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ABSTRACT 


The rise of generative AI has led many companies to hire freelanc- 
ers to harness its potential. However, this technology presents 
unique challenges to developers who have not previously engaged 
with it. Freelancers may find these challenges daunting due to the 
absence of organizational support and their reliance on positive 
client feedback. In a study involving 52 freelance developers, we 
identified multiple challenges associated with developing solu- 
tions based on generative AI. Freelancers often struggle with as- 
pects they perceive as unique to generative AI such as unpredict- 
ability of its output, the occurrence of hallucinations, and the in- 
consistent effort required due to trial-and-error prompting cycles. 
Further, the limitations of specific frameworks, such as token lim- 
its and long response times, add to the complexity. Hype-related 
issues, such as inflated client expectations and a rapidly evolving 
technological ecosystem, further exacerbate the difficulties. To ad- 
dress these issues, we propose Software Engineering for Genera- 
tive AI (SE4GenAI) and Hype-Induced Software Engineering 
(HypeSE) as areas where the software engineering community can 
provide effective guidance. This support is essential for freelancers 
working with generative AI and other emerging technologies. 
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1 INTRODUCTION 


Public interest in generative AI (GenAI) has recently surged. Nu- 
merous organizations are exploring the utility of large text-to-text 
and text-to-image foundation models, integrating them into their 
existing infrastructure or creating new applications. Small to mid- 
sized enterprises and startups, known for their agility, are prone 
to investigate new trends. Due to limited in-house development 
resources, they often depend on external freelancers who, con- 
tracted to deliver specific solutions, find themselves at the cutting 
edge of technological developments. Such freelancers take the role 
of explorers who open up new possibilities for their clients. 

GenAT has limitations that may hinder independent developers 
working on applications. They urgently require guidance. The first 
step to providing effective support is to identify and understand 
the challenges they face when developing GenAI-based applica- 
tions. Given GenAl's rapid uptake and the significant role free- 
lance developers play in its adoption, we ask: 


RQ: What challenges do freelancers experience when developing 
solutions based on generative AI? 


We explore the perspectives of freelance developers on GenAI- 
related projects through the analysis of 52 interviews and survey 
responses. Approximately 60% of these freelancers work on small- 
scale projects as the sole developer, 20% participate in medium- 
sized projects with up to five team members, and the remaining 
20% are either contractors in larger projects or operate a freelanc- 
ing agency. These freelancers identify 99 challenges across 11 
Software Engineering Body of Knowledge (SWEBOK) areas and 
54 subcategories. These software engineering (SE) challenges arise 
from the inherent characteristics of GenAI as a technology and 
the associated hype. We use the term 'GenAI technology’ to refer 
to the inherent features of GenAI as a computing paradigm, dif- 
ferentiating it from features of products that provide GenAI capa- 
bilities, such as large language models (LLMs) or foundation mod- 
els and their APIs. The ‘hype’ refers to the novelty and fashiona- 
bleness of GenAI among businesses. The characteristics of the par- 
adigm, its products, its novelty, and its trendiness magnify known 
challenges and create new ones, as perceived by the freelancers. 

Those findings are relevant to the SE community. Some of the 
issues identified by freelancers will likely impact the SE of GenAI- 
based applications for an extended period and, possibly, also be- 
yond the context of freelancing. Further, future technologies will 
keep emerging as hype, requiring practitioners and researchers to 
manage hype-related phenomena. SE research must address these 
aspects. Thus, this paper suggests two research areas: (1) Software 
Engineering for Generative AI (SE4GenAI) as an extension of SE 
for AI (SE4AI) discourse already present in SE, and (2) Hype- 
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Induced Software Engineering (HypeSE) to handle challenges as- 
sociated with novel and fashionable technologies. 

This manuscript contributes in three ways. First, it outlines 
challenges identified by freelancers who create GenAI-based so- 
lutions, adding a new, freelance-oriented perspective to SE4Al lit- 
erature [15, 16, 37, 38, 42, 45, 57]. Second, it classifies these chal- 
lenges highlighting the pivotal role of hype as a source of practical 
SE challenges. This complements earlier discussions of hype in SE, 
particularly relating to blockchain [14, 41, 43]. Third, it suggests 
new directions for SE research. 


2 BACKGROUND 


This study draws from practical background on GenAI, an emerg- 
ing discourse on IT freelancers, SE4AI literature, and studies on 
the influence of technological hypes on development. 


2.1 Current state of GenAI 


In most general terms, GenAI embraces AI systems capable of 
generating new content, e.g., realistic images of people who never 
existed, based on examples [24]. Due to a stochastic layer, GenAI 
is produces nondeterministic output', so one cannot predict what 
image will emerge out of a specific set of examples which creates 
the illusion of creativity [5, 30]. The broad uptake of GenAI is 
driven by the availability of foundation models, i.e., general-pur- 
pose models trained on extensive data sets that cover a wide range 
of general tasks and can be adapted for more specific tasks [8]. 
Progress in natural language processing (NLP) yielded foundation 
models which accept natural text as input - LLMs. Following the 
November 2022 launch of OpenAI's ChatGPT, offering a chat in- 
terface for a powerful LLM, interest in GenAI skyrocketed. The 
initial hype around GPT models extended to other products rely- 
ing on pretrained, externally hosted foundation models accessible 
through APIs [64]. Their rapid uptake captivated the scientific 
community, leading to numerous papers on potential implications 
for various sectors [18] and evaluating the models' suitability and 
quality for particular uses [7]. However, despite repeated calls, 
empirical data on GenAl’s impact remains scarce. 

An ecosystem of products has emerged around LLMs since late 
2022. Major vendors, including OpenAI or Google offer access to 
hosted, closed-source models. Other providers offer open-source 
models, hosted on cloud infrastructure provided by HuggingFace 
or Replicate. Users can also train or fine-tune their own models 
though this requires expensive computational resources. LLMs 
can be accessed via dedicated APIs. Orchestration frameworks 
such as LangChain? or LlamaIndex? bundle these APIs and assist 
in pipelining functions. If an application depends on context data, 
these frameworks also support data preprocessing, storage, and 
management. A typical stack includes hardware or cloud re- 
sources, a LLM, its API, and optional orchestration framework and 
databases [30, 34]. Developers use this stack to create applications, 
typically using Python supported by major frameworks. Plugins 
for other languages and no-code platforms like Bubble.io make 
GenAI even more accessible. This ecosystem is highly dynamic. 

This brief summary highlights that the development of GenAI- 
based applications extends far beyond simply prompting a model. 
More specifically, it underscores that GenAI, as a technology, is 
distinguished by its ability to generate novel, unseen content from 
examples or instructions in a non-deterministic fashion. It also 


1 This holds for most GenAI including classic Generative Adversarial Networks 
(GANs) and LLMs, but some approaches such as Wasserstein GANs are deterministic. 
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refers to a variety of products, including models, APIs, and op- 
tional frameworks that provide access to these capabilities. Sys- 
tems that leverage the unique capabilities of GenAI through these 
products are referred to as GenAI-based applications in this man- 
uscript. The value of these applications is derived mostly from the 
customization of foundational models’ capabilities for a specific 
task. 

We still lack a clear understanding of the challenges that arise 
during the development of such applications, particularly when 
the GenAI components depend on pre-existing external models. 
Comprehensive reviews of the challenges associated with LLMs 
have only recently become available as preprints [e.g., 30, 34, 64]. 
They rely on analysis of grey and white literature showing that 
LLMs are subjects to the following limitations: need for large 
training data, tokenization problems, computational require- 
ments, resources necessary for fine-tuning, inference latency, lim- 
ited context length, model updating and refinement, bias, infor- 
mation hallucinations, lack of explainability, reasoning errors, 
risk of adversarial attacks or malign use, model behavior changes 
over time, as well as spelling/counting errors [30]. Further, results 
produced by LLMs cannot be reproduced even with decoding tem- 
perature set to zero [34]. While these technical issues and charac- 
teristics are widely confirmed, their effect on the work of inde- 
pendent developers remains unclear. Furthermore, the provided 
summaries overlook potential challenges that may arise in various 
subprocesses of SE when the LLMs are required to function as a 
component within a more complex system. 

Natural-language prompting has become a central interaction 
mechanism with text-to-image and text-to-text foundation mod- 
els. Prompts are instructions given to LLMs to make them follow 
instructions or standardize the output. They are the primary ap- 
proach for users to control models’ behavior. Prompt engineering 
has swiftly become a buzzword to describe the activity of crafting 
prompts for specific tasks [11, 63]. Scientific literature starts to de- 
pict activities involved in prompt engineering leading to nascent 
guidance or development of support tools [3, 54, 61-63]. Despite 
an emerging body of patterns and best practices, prompt engineer- 
ing involves a significant amount of trial-and-error cycles with the 
success depending on the intuition of the prompter rather than 
systematic guidance [30, 63]. This is further amplified by prompt 
brittleness, i.e., unintuitive, large changes in the model output oc- 
curring due to very small changes in the prompt lake replacing 
words with synonyms or changing the order of requests [34]. 
Whereas those and other challenges mentioned above are docu- 
mented in public discourse and in the most recent studies, we lack 
empirical data on how software engineers perceive those aspects 
and how the SE challenges of GenAI compare to the challenges 
known from non-generative machine learning (ML) technologies. 

The impact of GenAI in software (SW) industry is currently a 
hot topic. Yet most conversations evolve around the pros and cons 
of using LLMs for code generation [13, 46, 61, 62]. Discussion 
about the use of them in applications remains a niche discourse 
limited, mostly, to specific application areas [52]. A proposed 
GenAI design process involves problem definition, model selec- 
tion or model training, adaptation and alignment cycle (involving 
prompt engineering, fine-tuning, human alignment, and evalua- 
tion), and optimization and deployment [34, 64]. Although accu- 
rate, this advice is rather broad and lacks specific insights on how 
to tackle challenges unique to each phase. Moreover, to create 


2 python langchain.com, retrieved July 23, 2023 
3 gpt-index.readthedocs.io/en/latest/, retrieved July 23, 2023 
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value, GenAI technologies need to be integrated into either exist- 
ing or new applications. This integration process encompasses 
standard SE steps, as well as design activities specific to GenAI. 
This could potentially complicate the development of GenAI- 
based applications, thereby introducing new SE challenges. This 
study aims to investigate these challenges from the viewpoint of 
freelance professionals. 


2.2 Freelance developers 


Freelancers, independent developers, and small agencies are ac- 
tively engaging in the latest GenAI trends. A significant number 
of trending GitHub repositories related to GenAI are managed by 
individuals, not large tech corporations. Also, LLM orchestration 
frameworks such as LangChain and Llamalndex are an outcome 
of independent and startup projects. Conversely, many client 
companies, lacking their own IT personnel and resources, are un- 
able to delve into the potential of GenAI independently. As a re- 
sult, they often turn to freelancers, viewing them as a cost-effec- 
tive alternative to professional IT firms. This scenario places free- 
lancers in a crucial role within the GenAI ecosystem. 

Freelance developers, while integral to the IT industry, face nu- 
merous specific challenges such as inconsistent income, reliance 
on past clients’ reviews, variable competition, evolving skill re- 
quirements, and dependency on freelancing platforms [28, 29]. 
They are often the first to be affected by shifts in trends and hypes 
[29], yet they play a pivotal role in the global delivery of IT devel- 
opment services and digital transformation [25, 26]. Existing re- 
search on freelancers in IT tends to take a broad approach, often 
overlooking the specific activities or project types these profes- 
sionals engage in [26]. The subject of freelance developers is sel- 
dom addressed in SE research, and when it is, it is usually in the 
context of crowdsourced software development [1, 39]. Neverthe- 
less, freelancers are increasingly serving as independent develop- 
ers who can be directly contracted by client companies via free- 
lancing platforms [28], handling more complex tasks than mere 
sporadic assistance [39]. We posit that the significance of freelanc- 
ers will continue to grow with the advent of code generation mod- 
els, enabling a single individual to efficiently develop an applica- 
tion or software module independently. However, this potential 
can only be fully realized if clients understand freelancers’ view- 
points on emerging trends, thereby facilitating more effective col- 
laboration. This has motivated our focus on freelance developers. 


2.3 Development of Al-based applications 


SE for AI (SE4AI) and SE for ML (SE4ML) are an expanding re- 
search domain that underscores the distinctive characteristics of 
Al-based development in contrast to conventional SW projects. 
Various studies [6, 15, 16, 31, 37, 38, 42, 57, 59] have indicated that 
uncertainty factors such as the probabilistic aspect of ML and de- 
pendency on big data present a multitude of issues for developers. 
A range of meta-studies summarizes the findings [23, 40, 42, 45]. 
Multiple studies attended to those challenges. Meanwhile, several 
comprehensive meta-reviews emerged listing and sorting the 
challenges according to ML process steps [47], SWEBOK 
knowledge areas [23, 42], or newly developed categories [44]. 
Many of the challenges listed in those studies overlap. Never- 
theless, we also observed subtle differences in how they frame the 
challenges or how they order them. Given this observation, we 
combined four influential refereed meta-reviews published 


4 The supplementary material is available under https://doi.org/10.17605/osf.io/njc25 
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between 2021 and 2023 [23, 42, 44, 47] to obtain a good coverage 
of various challenges reported in the literature. We identified 137 
distinct challenges provided in supplementary material [17]*. Ta- 
ble 2 presents a selection of those challenges. To summarize, the 
main challenges stem from inadequate or absent data, necessity to 
engage in training as basic mode of creating AI capabilities, and 
the unpredictable and inherently complex nature of ML-based AI 
as evidenced by the changing anything changes everything (CACE) 
principle and models’ opaqueness. The literature identifies lack of 
adequate guidance and process models to address those chal- 
lenges, as well as problems with integrations of AI into end-user 
applications due to downstream compatibility problems caused by 
unpredictable or difficult to control output from AI components. 

As explained above, GenAI contributes further technical chal- 
lenges which the community just has started to understand and 
describe. This paper takes the perspective of freelance developers 
to understand the challenges perceive as particularly important 
and how they compare to SE4AI challenges. However, we specu- 
late that technology is not the only source of challenges related to 
GenAI that freelancers need to deal with. 


2.4 Technological hype and development 


Despite the ongoing hype around GenAI [55] and other technolo- 
gies, we found limited literature on SW development for hyped 
technologies and no recent comprehensive reviews. To provide 
background, we gathered literature from various fields, but the 
fragmented knowledge highlights the need for more research. 

Hype occurs when a novel subject is heavily promoted across 
media platforms like newspapers, TV, and social networks to gar- 
ner attention [10]. Since 1995, Gartner Inc., a consulting firm, has 
measured hype as fluctuating expectations or visibility over time, 
marking stages such as innovation trigger, peak of inflated expec- 
tations, trough of disillusionment, slope of enlightenment, and 
plateau of productivity [21, 53]. Typically, hype is used in relation 
to technologies reaching the peak of inflated expectations. Such 
technologies become fashionable, i.e., popular and socially de- 
manded at this specific point in time [51]. Considering this discus- 
sion, we suggest defining technological hype as such: A technol- 
ogy experiences hype when it possesses inherent novelty (or in- 
cludes novel features), thereby making it fashionable within busi- 
ness circles and society. 

Hypes come and go. Various technologies gained attention in 
the past two decades. SE literature specifically attended to the im- 
plications of blockchain hype. The blockchain hype peaked in 
2016-2017 [22]. With many platforms offering blockchain services, 
the focus of SE literature was on developing reliable smart con- 
tracts. However, traditional SW development life cycle models 
proved inadequate due to the immutability of smart contracts and 
the reliance on expert reviews over code testing [14, 43]. This trig- 
gered development of new techniques for Blockchain-Oriented SE 
to accommodate for the inherent nature of this technology [14, 
49]. The hype around blockchain, rather than being solely nega- 
tive, became an opportunity to problematize and improve SE prac- 
tice and theory as pointed out by some researchers [12, 41]. Simi- 
larly, the hype around ML also garnered SE attention yielding in- 
sights listed above [12]. The concept of hype as a topic started en- 
tering SE discourse. Yet, it remains unclear how the fact that a 
technology is subject to hype impacts practitioners’ work. 

Agile development and its extensions, like MLOps or DevOps, 
were proposed to manage the novelty of technology and changing 
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context. They became a standard across many companies [2]. Ag- 
ile assumes that development can be divided into manageable 
parts, conflicting with the CACE principle. It also disregards the 
impact of hype, which significantly influences requirements. De- 
spite these issues, agile remains influential in shaping SE practice. 

Hype around technology often increases demand for related 
solutions and prototypes. Driven by media coverage and success 
stories, companies seek to leverage these new technologies. This 
mirrors solution-based probing, where a technology is field-tested 
[9]. However, companies aim not just to learn, but also to create 
business value. The increased demand significantly affects the 
workforce, potentially leading to new developer profiles. 

Two phenomena emerge due to this dynamic. The first, resume- 
driven development, occurs when firms and developers include 
trending technologies in job listings and resumes. This practice 
can foster harmful dynamics, compromise team reliability, and 
stress both employees and employers [19, 20]. The second, hype- 
driven development, often discussed by practitioners but not re- 
searchers [4, 33], occurs when developers use hyped technology 
without considering its architectural implications or integration 
effort. This particularly affects independent and agile developers. 

Despite the recognition of hype's impact on SE practice, there 
is still limited research and guidance for freelancers navigating the 
technological opportunities and risks. Given the rapid pace of 
technological advancements, this topic warrants more attention. 
Future hypes are inevitable. Particularly, the specific challenges 
that arise when a technology, like GenAI, becomes a major hype, 
remain unclear. This study contributes to this by analyzing chal- 
lenges related to GenAI and discussing the role of hype in them. 


3 METHOD 


This qualitative study identifies challenges reported by freelancers 
in GenAI projects. It uses a survey and interviews from 52 active 
freelance developers. The study was approved by the University's 
Institutional Review Board in May 2023. We collected data in May 
and June 2023 and analyzed it in June and July 2023. The study 
compares those issues against SE4AI / SE4ML challenges reported 
in earlier studies. We provide data concerning the method em- 
ployed and the full list of challenges identified in SE4AI meta-re- 
views in supplementary material. 


3.1 Literature Study 


To compile a comprehensive, current list of challenges reported in 
SE4AI and SE4ML literature, we referred to four extensive meta- 
reviews [23, 42, 44, 47]. These were published in recognized, peer- 
reviewed outlets from 2021 to 2023, summarizing challenges re- 
lated to Al-based application development reported by empirical 
studies. We extracted individual challenges from each meta-re- 
view, not only from tables or graphics but also from textual de- 
scriptions for a more detailed understanding. After removing du- 
plicates, we identified 276 unique challenges with slight differ- 
ences. We then grouped these challenges into 11 Knowledge Ar- 
eas, including a new one, Training and Testing Data, based on the 
reviews [23, 42, 44]. Within these Knowledge Areas, we catego- 
rized the challenges into 54 groups, and after merging near-dupli- 
cates, we ended up with 137 challenges (see Table 2 and supple- 
mentary material [17]). These serve as a comparison basis with 
challenges mentioned by freelancers interviewed in this study. 


3.2 Participants 


Dolata et al. 


We selected freelancers via Upwork, a freelancing platform. On 
May 1, we posted a job description offering $35 for a 10-minute 
survey and a 45-to-60-minute interview. We invited 450 freelanc- 
ers (300 in May, 150 in June) who met specific criteria, including 
having a profile created before 2023, explicitly mentioning GenAI 
(‘GPT’, ‘LLM’, etc.) in their profile, and showing prior experience 
with the GenAI platforms and APIs. Following to the Upwork pro- 
cess, interested freelancers needed to submit a ‘proposal’ which, 
in our case, consisted of answers to four questions about their ex- 
perience, GenAI knowledge, project history, and education. We 
received 96 answers. Some applicants had one or none GenAI pro- 
jects while we required min. 2, provided nonsensical answers (e.g., 
empty ones), or proposed significantly higher rates, e.g., $1000. 
After filtering, we were left with 81 potential participants. 

These individuals received a formal Upwork contract offer and 
survey invitation. However, 24 did not accept the offer or take the 
survey. Five participants withdrew from the study or became un- 
responsive after accepting the offer or filling out the survey. We 
excluded their responses from our analysis. Ultimately, 52 free- 
lancers participated in the study. 

Our participant group was diverse, as detailed in Table 1. The 
mean age was 32.6 years. On average, respondents had partici- 
pated in 4.8 GenAl-related projects, a reasonable figure consider- 
ing GenAl's broad uptake began in late 2022. Their average expe- 
rience as developers was 6.9 years, be it as freelancer or in other 
settings. Most held a bachelor's or master's degree, with 36 spe- 
cializing in informatics-related subjects. They reported using 14 
different models/platforms, with 52 using GPT and 21 using 
DALL-E. They came from 23 countries across all inhabited conti- 
nents. Unfortunately, gender diversity was low, with only two fe- 
males despite our efforts to attract more. 


Table 1. Descriptive statistics regarding study participants 


Attribute | Values and frequency among the participants 


average: 32.6 years old, median: 31, max: 55, min: 19 


A 
Be (years) | ogy: 11, 25-29:5, 30-34: 17, 35-39: 11, 40-49:5, 25023 


Country of Joverall, 23 different countries including 
Residence |USA: 7, Pakistan: 6, India: 5, Nigeria: 4, Serbia: 4, Other: 26 


No. GenAI |average: 4.8 projects on GenAI, median: 4, max: 15, min: 2 
projects 2 projects:5, 3:11, 4:11, 5:13, 6:4, 7:4, 8:1, 210:3 


Developer |average: 6.9 years of experience, median: 5, max: 20, min: 1 
experience | <2 years: 8, 3-4:10, 5-6:13, 7-85, 8-10:8, 210:8 


Degree PhD: 6, MSc/MA: 21, BSc/BA: 21, Other: 1, BSc-Student: 3 
Overall, 20 distinct study subjects including 

Subject Computer Science: 18, ML / AI / NLP: 5, SE: 4, Electrical 

studied Engineering: 4, Physics: 3, Telecommunications: 2, Eco- 
nomics: 2, Geography: 2, IT Management/IS: 2. 

GenAI plat- GPT: 52, DALL-E: 21, Stable Diffusion: 18, Midjourney: 17, 

Pla |LLaMA: 13, LaMDA/Bard: 6, Vicuna: 5, Others: 11; four 
forms used 


participants mention fine-tuned or trained custom models 


3.3 Data Collection 


We employed three data collection methods. Initially, partici- 
pants provided basic demographic details with their proposal. Af- 
ter contract acceptance and study consent, they completed a sur- 
vey on 2-5 chosen projects, and the main challenges and rewards 
of GenAI applications development. Then, they arranged an inter- 
view. The first author conducted these interviews, based on the 
problem-centered interview paradigm [58]. The data collected in 
the proposal and survey provided background for the exploration 
of the problems. The interview commenced with introduction and 
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project background, followed by interviewees narrating their 
journey to becoming GenAI freelancers and discussing one or two 
selected projects that they were particularly proud of, or found 
challenging or intriguing. The elicited narratives were enhanced 
by dialogues featuring semi-structured questions related to free- 
lancers' positive and negative experiences with GenAI, compari- 
sons earlier and GenAI-based projects, project uncertainties, chal- 
lenges, enjoyable aspects of working with GenAI, and their views 
on freelancing. The sequence of questions was adapted to the dia- 
logue's flow. During the interview, the interviewer was referring 
back to challenges and projects participants described in the sur- 
vey. The interviewer took notes to revisit crucial points from the 
narratives later in the interview. 

The interviews were conducted via Zoom, averaging 55.7 
minutes each (median: 55, min: 42, max: 81). All were in English 
and transcribed verbatim, then anonymized. The connection be- 
tween demographic and other data was only possible through a 
token, ensuring privacy and security standards. We achieved data 
saturation — last 10 interviews did not introduce any significant 
new insights. We provide interview guides and other material re- 
lated to data collection in the supplemental material. 


3.4 Data Analysis 


The data analysis process was iterative and combined thematic 
analysis during the initial coding followed interpretive analysis 
step during workshops [50, 56]. The second author conducted 
qualitative coding of the interview data in a bottom-up manner to 
avoid potential biases which could have emerged in the first au- 
thor during interviews. He identified problems or issues, coding 
268 segments as challenges. This process was supervised by all au- 
thors, with edge cases debated among them. After initial coding, 
two interpretation workshops were held. For a granular overview 
of challenges, they were distributed according to the SWEBOK 
Knowledge Areas [42] identified in the literature analysis of SE4AI 
meta-reviews (see Table 2 and [17]). We obtained multiple over- 
lapping statements for all categories (except Training and Testing 
Data) confirming the saturation. 

During interpretation workshops, the authors observed that 
freelancers differentiated between challenges that occurred be- 
cause of the GenAI technology and because of the hype around it. 
For instance, freelancers predicted that some challenges might re- 
solve at the end of the hype. Others were claimed to result from 
the attributes of GenAI as technology. After further exploration, 
authors were able to discriminate between challenges that were 
associated with the novelty and fashionableness of Gen AI as two 
dimensions of hype, and paradigm and product as two dimensions 
of technology. All authors agreed that this differentiation contrib- 
utes to the understanding of collected insights. Supplementary 
material [17] provides further information about the coding 
schema and the freelancers quoted below. 


4 RESULTS 


Freelancers perceive multiple challenges distributed across 11 KAs 
and 54 categories as presented in Table 2. We count at least 99 
distinct challenges related to development of applications based 
on GenAI of different granularity. Importantly, rather than iden- 
tifying objective challenges, we focus on freelancers’ perceptions. 
We want to understand what they consider unique, amplified, or 


5 We use anonymous codes of the participants followed by their age and years of 
experience as developers to indicate the authors of the statements (xx-<age>- 
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reinforced due to the technological aspects of GenAI or the hype. 
This focus allows to produce guidance and identify development 
areas which address specific needs of this professional group. We 
first present the findings along the dimensions of technology and 
hype. In 4.5, we summarize the findings and present Table 2 which 
systematizes the findings. 


4.1 GenAI Technology as Source of Challenges 


Characteristics of GenAl Paradigm 


Freelancers associate many challenges with the unique nature of 
GenAI. While some issues may be common to AI in general, 
GenAl's specific characteristics and accessibility create unique ob- 
stacles. LLMs are trained on vast datasets to produce high-quality, 
versatile outputs. However, these characteristics makes it impos- 
sible to trace and explain the model's behavior or identify the 
sources of a specific, problematic output. Freelancers indicate that 
this issue is exacerbated by many LLM providers' lack of transpar- 
ency about training data sources and fine-tuning processes. For 
instance, it is unclear whether the providers used manual correc- 
tion for specific use cases. This uncertainty affects both the con- 
tent and form of the model's response. Freelancers say, they can 
control output’s content by limiting the model to provided context 
data in retrieval augmented generation tasks which rely on pro- 
cessing of contextual information delivered to the model together 
with the prompt. Yet the response can still vary in tone, style, and 
structure. For instance, it might have inconsistent output format 
(bullet list vs. flow text). Measures to limit variability, such as 
GPT's ‘temperature’ parameter, are probabilistic, so freelancers 
report to observe different output for the same prompt even if 
temperature is set to minimum. The opaque relationship between 
a prompt and the output requires freelancers to invest significant 
time in trial-and-error process for crafting prompts: 


I guess one of the most time-consuming things is to make it work. AI is 
a bit random and prompt engineering quite takes some time. And at that 
moment when you're like, ‘okay, it works’, some unstudied questions or 
something, you didn’t try, [comes up] and you suddenly figure out that 
it doesn't produce results you wanted. (va-19-3) 


A significant issue discussed by many freelancers are halluci- 
nations. When prompted, GenAI models will attempt to generate 
an answer, even without relevant knowledge about the topic. This 
can result in incorrect or nonsensical text or images. Some free- 
lancers feel uncomfortable delivering products without a guaran- 
tee of consistent, correct behavior. A seasoned freelancer explains 
that his dedication to high-quality products makes him the slowest 
part of the development process: 


Well, the hallucinations, if the AI starts going off in creating information 
because one asked for it (...). Sometimes it's amusing, but other times it's 
like, ‘Oh my God! (...) It's lies. It's things that it's made up. There's no 
truthful basis to [check it] it, especially involving people. (...) So, I'm the 
slowest part in the process because I don't trust the AI. (...) So, I have to 
read everything and validate what it's doing. (...) Because, if you just 
take the output of the AI and trust it, it's going to bite you. (cl-55-15) 


Output hallucinations necessitate the establishment of guard- 
rails, which may include filters or temperature reduction. How- 
ever, this can limit GenAl’s creativity which is desired for some 
use cases. A freelancer suggests that safety measures may conflict 
with client expectations, especially if the latter are unaware of 
GenAl's potential for hallucination: 


<experience>). For a detailed information about the participants, including the au- 
thors of those statements, see supplemental material (2.1.4 — 2.1.5). 


Authors’ Manuscript accepted for ICSE ’24, April 14-20, Lisbon 


[There is] the idea of safety and kind of guardrails. (...) It's kind of a 
push-and-pull with certain clients. Because they say, ‘oh, the text looks 
boring’. And, in my mind, I'm thinking, ‘yeah, but it's better than the 
text looking crazy’. (...) Just because everything's looking good right now, 
doesn't mean things can't go off the edge. And, maybe, clients don't un- 
derstand that, because, if you're doing a good job, they're never seeing it 
go off the edge. So, they don't know that the edge is there. (ev-38-13) 


Freelancers mention further challenges related to GenAI ap- 
pearing during SW construction. Retrieval augmented generation 
applications rely on clients’ data which provide context for LLMs 
to generate the answer. The intricate nature of LLMs, coupled with 
complex architecture and data, makes it exceedingly difficult for 
developers to identify the cause of incorrect or imprecise re- 
sponses. A developer describes this issue as follows: 


For me, it's always about the data. (...) You get the data, you pump it in 
Pinecone and all that, and then start querying that. You realize at some 
point whatever the bot is giving you back is not as accurate as you 
wanted it to be. (...) Now you start asking yourself: is it the problem with 
the system, the workflow, the Pinecone, and querying setup problem? Is 
it a little problem? Or now you have to start thinking what could be the 
problem? So, for me, the data is the first step. (do-31-5) 


One of the consequences of these challenges is the emergence 
of downstream compatibility issues, which many freelancers find 
novel. The lack of control over the output, including its format, 
complicates software design. Freelancers must accommodate for 
variability through architectural decisions and focus on the inter- 
dependencies between components. This situation exacerbates re- 
producibility issues, rendering software testing less reliable and 
quality assurance more difficult. Furthermore, freelancers observe 
that large language models (LLMs) provide a set of interconnected 
advanced capabilities (e.g., evaluating and responding to a letter), 
making modularization unachievable. This complexity is a depar- 
ture from earlier AI models, which were trained for individual ca- 
pabilities. The opacity and scale of LLMs and other general AI 
technologies, combined with the absence of explanatory mecha- 
nisms, heighten the necessity for a trial-and-error approach. 

Some interviewees consider training or fine-tuning as potential 
solution to some of the problems. Yet, they link the GenAI para- 
digm with enhanced expenses, high computational demands, and 
need for large volume of data beyond what was needed for earlier 
ML models. Only one interviewee mentioned having trained a 
model and claimed its cost higher than 10000 USD for a single 
training round. For most freelancers such expenses are beyond the 
project budgets. This is a new aspect to those who worked with 
ML before and were used to train and re-train smaller models. 

However, freelancers also indicate that available LLMs para- 
digm lessened the data and domain dependency. They speak of 
LLMs being able to deal with various data formats and structures 
for retrieval augmented generation. Also, they emphasize that 
progress they achieved in one project (e.g., prompt patterns) can 
be transferred easily to a project in a different domain, thanks to 
the generality of LLMs. This reduces pressure on SW construction 
and is ideal for freelancers who frequently move between projects. 


Characteristics of GenAl Products 

Many providers offer GenAl as a product or service, supplemented 
with APIs. This is the primary delivery method for LLMs. Despite 
offering easy access and a range of tools, freelancers report these 
models can also be unpredictively expensive. Providers adopt a 
pay-per-use strategy, where the cost per request is determined by 
the length of the prompt or response. This makes it challenging to 
estimate running costs for applications, particularly when user in- 
puts are factored into the original prompt designed by the 
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developer. This uncertainty also complicates budgeting for devel- 
opers, as they struggle to gauge the economic resources required 
for the trial-and-error process and subsequent application testing. 
This complicates the SW configuration management and SE man- 
agement. One freelancer elaborates on these budgeting chal- 
lenges: 


You have a budget limit. So, you can't spend all the client's budget by 
doing a lot of prompts. You have to write it and to check it, but you don't 
have a lot of opportunities to rewrite, especially for big data. It's also a 
very difficult question and difficult issue because (...) if you if you have 
a coin and have 50 times tails, it doesn't mean that after the next one, 
you will get a tail. So, it's the same with GPT results because if you a 
dataset with 10 million [documents] and you will check manually 100 
results, it doesn't mean everything was okay. (...) It's a kind of tradeoff 
because there are the limits, the time, and the money. And you can spend 
one of them, but you always have to check on the other. (ab-38-2) 


Further, freelancers identify unique technical constraints in 
LLMs which impact SW construction and maintenance. A com- 
monly cited issue is the prompt or context window's size limits. 
There are strict restrictions on the number of tokens a single 
prompt can deliver (for example, 16,000 for GPT4), which might 
be insufficient. A freelancer who designed a search and summary 
engine for a large dataset explains this as follows: 


It is very important to find knowledge in the huge amount of data and 
you can take, for example, 1010 pages of a form with needed data and 
summarization for these ten pages. But for now, GPT can utilize only 
two pages and it is the limit. (eu-35-15) 


Software maintenance is hindered by the need to frequently 
adapt to model changes, a factor beyond the control of freelancers 
who depend on proprietary, closed-source models. Providers often 
update these models without transparency, leading to daily per- 
formance changes. This raises concerns about the consistency of 
model outputs, the predictability of application usefulness, and 
the need for ongoing re-validation of prompts and applications. 
This uncertainty also limits freelancers’ project selection as they 
prefer to undertake projects, they can successfully deliver to gar- 
ner positive client feedback and improve their completion rate. To 
mitigate this, freelancers engage in preselection activities like 
testing different prompts or a subset of client data before commit- 
ting. However, their preselection decisions can be invalidated 
within days due to model changes, posing challenges due to in- 
creased dependency on external providers. Freelancers also worry 
about quality assurance, as they lack information on whether the 
external models are quality-assured, particularly in terms of secu- 
rity or privacy. The absence of this information from providers 
leaves freelancers’ clients with doubts. 

Despite obstacles, freelancers note that the emergence of ro- 
bust foundational models has revolutionized their approach to de- 
livering intelligent features. Previously, access to large, high-qual- 
ity data sets, particularly in smaller projects, posed a significant 
hurdle in producing satisfactory products like free text chatbots. 
The introduction of LLMs has alleviated the strain associated with 
training and testing data, a key issue in previous AI generations. 


4.2 GenAI Hype as Source of Challenges 


Novelty of GenAl 

The novelty of GenAI adds further challenges. Novelty amplifies 
and reinforces challenges related to unrealistic expectations from 
the clients and their limited knowledge. Lacking experience and 
reference projects, clients do not have a realistic grasp of GenAT s 
capabilities and do not know what developers can or cannot do. 
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This imposes extra effort on freelancers in the project's initial 
phase: instead of focusing on work, they must spend time eluci- 
dating and demonstrating GenAI's limitations which demands 
more communication with the clients and demands new explana- 
tion skills. A freelancer notes that clients sometimes question 
whether the problems depend on the technology or the freelancer. 


In the end, this is a source of issues for me because they [clients] don't 
know what to expect. It seems like it [GPT] can do pretty much anything. 
[And if problems occur] they're still not sure: maybe it can do it, but I 
[the developer] am the one that cannot do it, or maybe it really is a tech- 
nical limitation. It's very hard to communicate (...) what are the limita- 
tions. And moreover, why there are those limitations? (...) I try my best 
to explain it. And, so, I basically oversimplify. (ia-28-5) 


Many interviewees cite misinformation in public or social me- 
dia as a problem. Clients often develop unrealistic expectations 
from such misleading information about what GenAI can achieve 
and how quickly. A developer, who is holds a PhD in computer 
science and has extensive experience with ML, explains: 


I got a man who watched a video on YouTube, came to me, and said ‘We 
need exactly this’. I said ‘The thing you are referring is like a work of 
dozens of people for 3 to 6 months. And you want me to build this solu- 
tion within seven days. So, this is not possible.’ But he was insisting me 
to: Tf they can do this, then you must do too’. (ml-30-5) 


Another issue is the absence of a supportive community. While 
many freelancers appreciate the wealth of resources like courses, 
videos, and blogs, some struggle to locate the right community. 
As the GenAI becomes more popular and use cases solidify, this is 
expected to shift. This situation puts additional pressure on soft- 
ware engineering practices. The absence of a shared knowledge 
base and reference communities can hinder software develop- 
ment. A freelancer and start-up founder, who creates GenAI- 
based gaming applications, makes the following comparison: 


For example, if you have any trouble with the Unity [a game engine 
platform], I'm sure you can point to the Unity Forum, and someone asked 
these questions. (...) I'm sure there are many little communities [around 
GenAI]. (...) There are a lot of people [engaged in them], but I'm not sure 
how many people worked in Unity integration with the generative AI 
and how can we find each other? I don't know. (tx-30-6) 


The unique nature of GenAI amplifies the challenge of unpre- 
dictable results, necessitating a trial-and-error method. With 
scarce experience, intuition, and lack of reference communities, 
freelancers struggle to assess a client's vision's feasibility. They 
experiment, boosting their initial, unpaid workload and stress. 
Given that clients choose freelancers based on early interactions, 
freelancers might end up working extended hours for no gain. 

Freelancers concur that certain challenges may diminish over 
time. Initially, they anticipate accumulating knowledge that will 
simplify choosing appropriate clients and projects, confidently 
communicating with them, and establishing connections with ref- 
erence communities. They believe that emerging metrics and qual- 
ity assurance measures will validate their work's robustness. Sec- 
ondly, they trust that potential clients will better understand their 
expectations and how to articulate their projects. Lastly, they pre- 
dict that new tools, processes, orchestration frameworks, and ad- 
ditional resources will be introduced to aid in the development of 
applications and systems with GenAI components. 


Fashionableness of GenAl 

The fashionable character of GenAI developments present signif- 
icant challenges. Older GenAI approaches like GANs, while avail- 
able for around 10 years [24], have remained niche, used for spe- 
cific cases. However, the arrival of GenAI relying on large, pre- 
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trained models that accepts natural input has sparked new inter- 
est. Many companies are eager to utilize its seemingly unlimited 
capabilities hastily, often overlooking cheaper or simpler alterna- 
tives and not adequately examining their business needs. They are 
uncertain about their desired outcomes, making it challenging for 
them to define the metrics for evaluations. They also mistakenly 
assume that domain-dependent applications can be constructed 
without domain-specific data, expecting that LLMs possess access 
to all conceivable knowledge. An interviewee from a freelance 
agency emphasized that the hype exacerbates the problem of in- 
sufficient and unrealistic expectations, increasing pressure on pro- 
cesses associated with software requirements engineering and 
customer communication: 


I think the biggest challenge from a non-technical perspective is under- 
standing the use case. Many times, people would rush into [GenAI], 
‘Yeah, I want to chat about something. I want a generative AI solution. ’, 
but their problem can be solved with a much simpler solution. And, I 
think, it's really about understanding the problem before jumping into 
the solution instead of starting with the solution and trying to find the 
problem that fits for it. (ad-30-10) 


However, some clients limit access to end-users and other re- 
sources necessary for requirements engineering driven by the in- 
tention to obtain their solution in the shortest time possible. This 
has implications for subsequent phases of the SE process and 
might result in dissatisfied clients and, consequently, freelancers’ 
low ranking. Therefore, some freelancers try to avoid potential cli- 
ents who appear too much hype driven. 

The surge in interest has additional implications. The rise in 
requests puts freelancers in a position of choice. This demand and 
accessibility of GenAI capabilities also entices less-experienced 
developers to enter the market. On the one hand, it increases the 
price competition as new freelancers try to attract clients through 
low prices. On the other hand, new freelance developers fre- 
quently lack the necessary skills to produce high-quality SW, but 
the high demand assures that they will be booked for projects. 
Some freelancers take on projects previously developed by others. 
An interviewee explains how he had to navigate these issues: 


The work done on LangChain and OpenAI on Upwork is by developers 
that are not as experienced. Oftentimes clients end up with spaghetti code 
or some issue that they struggle with. (...) They got one file of Python, 
maybe a thousand lines of code, something like this. And I sat down, and 
I refactored it, and then I split into 6 or 7 files. (jf-23-5) 


This observation pertains to anti-patterns previously identified 
in standard SE and non-generative ML. Yet, the hype tends to 
draw freelancers who perpetuate these anti-patterns. These be- 
come problematic when there is a need to expand the application 
or assume responsibility for its maintenance and updates. 

The ecosystem around generative models is incredibly complex 
and dynamic, with new solutions constantly emerging. This en- 
hances pressure on freelancers, who often find their newly created 
solutions are rendered obsolete by new off-the-shelf tools, frame- 
works, or APIs. Moreover, they struggle to keep pace with the fre- 
quent technological changes. While upskilling is crucial for suc- 
cess, the unprecedented rate of change demands daily effort to 
stay updated and flexibility to adopt new solutions quickly. The 
dynamics underscores the importance of sensemaking and staying 
updated. Even seasoned freelancers perceive this rate of change as 
something distinct to the buzz around GenAI. An experienced de- 
veloper illustrates how this influences his daily routine: 

We're right on the edge of trying to keep up. And especially if you're 
looking at multiple solutions like I do, not just OpenAL, but Google and 
then, IBM, - it's a lot. There's so much coming out. And then, the cloud 
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vendors and supporting vendors, they're releasing like crazy. And then, 
all of their Al-integrated products are hitting the market almost every 
day. (...). So, I'm still trying to catch up on their press release. And then 
boom, here comes AWS. You know, it's easily a couple of hours a day just 
trying to stay current with everybody. (...) In the early morning, I go 
through my LinkedIn connections, and then go through my emails, and 
my calendar. And then around noon I'll repeat that. And then just before 
dinner. And then, sometime between 10 and 11 at night. (cl-55-15) 


The rising GenAl trend increases pressure on freelancers, lead- 
ing to issues in requirements engineering, professional practice, 
and SE management. Freelancers must skillfully manage their cli- 
ents and invest significant effort to match technology with client 
needs. Even experienced freelancers find this hype novel and un- 
precedented as it magnifies older challenges and yields new ones. 


4.3 Summary 


Freelancers face many struggles when developing GenAI-based 
applications. Many appear new to them. Other seem aggravated 
compared to ML or their previous experience. However, freelanc- 
ers also indicate that some old challenges get less important. 


Dolata et al. 


Based on the opinions collected from the freelancers and the 
literature covering SE4AI / SE4ML challenges, we compiled Table 
2 to summarize our results and compare them against earlier stud- 
ies. Literature refers to general challenges of developers while our 
results present freelancers’ standpoints. Specifically, columns 
three and four in the table reflect statements from our data and 
our interpretation thereof. We use the last column in the table to 
indicate factors with primary influence on the challenges sub- 
sumed under a specific category. For instance, freelancers indi- 
cated fashionableness and novelty as major factors amplifying and 
aggravating challenges related to customer expectations, which we 
indicate with ^. The star, X, points to categories for which free- 
lancers identified unique or new challenges. 7 indicate which fac- 
tors contribute to upholding challenges regarding GenAI in the 
category. For instance, novelty of GenAI results in clients’ lack of 
knowledge about its capabilities. While freelancers mention this 
aspect as important, from their perspective similar problems were 
occurring earlier with ML. » informs when a factor lessened the 
impact of challenges previously reported for AI/ML. This pertains 
especially to novel technological characteristics of LLMs which 


resolve problems of earlier ML-based solutions. 


Table 2. Comparison between challenges and needs reported in SE4AI literature and challenges and needs expressed by 
freelancers in our study along with indication of sources sorted by SWEBOK KAs and identified subcategories. See supple- 
mentary material [17] for the full version of this table with further distinction between singular challenges’ impact factors. 


KA | Subcategory Caen ie Nescis in SESA / SEAML Liieraiure Challenges and Needs Mentioned by Freelancers (FL) Impact 
Customers’ |° unrealistic 100% accuracy expectations, * clients have too high expectations towards abilities of Al because of | , fashion 
Expectations |° unrealistic expectations toward adoption (e.g., run- | the hype around GenAl, è clients demand use of GenAl despite mis- | , novelty 

g P ning system with too little data) match between the business requirements and GenAl’s capabilities 

a Customers’ e clients use demanding/unrealistic projects to learn about limita- 

$ Limited e lack of literacy concerning the capabilities of Al, tions of GenAl generating risks for FL’s rating, e new non-technical cli- | 7 fashion 

= Knowledge e lack of knowledge on quantitative metrics ents request GenAl-based solutions due to inflated expectations 7 novelty 

5 about GenAl’s capabilities 

E e clients and FL lack effective statistical measures for quality assess- 

5 e statistical metrics do not match requirements, ment of generated content (as opposed, e.g., to precision and recall 

> Metrics vs e statistical metrics do not match business metrics, measures for classification tasks), © assessment of generated con- * paradigm 

E Reauirements | ° difficult to use requirements coverage method, tent requires domain expertise, ¢ reliable ground truth is very difficult / | ^ fashion 

n q e lack of coverage-oriented datasets, impossible to create so evaluation cannot be reliably automatized, ^ novelty 

e no operationalization of coverage for ML e clients lack adequate business or quality criteria for new tasks if 
they haven’t been previously conducted by humans 
A art ales between all parts of ML- e orchestration was very difficult after release of some LLMs due to 
Components a : ; reves fe their novelty but lessened with new orchestration frameworks, * paradigm 
Orchestration oe. complexity due to distributed architec e downstream compatibility is hard to achieve and maintain due to ^ novelty 

= e hard-to-manage interactions between ML models, hondeterminism of GenAl's output forrat 

2 e singular LLM’s capabilities (e.g., generating answer vs. generating 

& | Inherent e changing anything changes everything, summary) are impossible to separate technically because they rely 

® | Complexity of | ° paradigm shift to pipeline-driven / system-wide view, | on a single model, ¢ similar or seemingly identical prompts can trig- Jearadiani 

S | ML plexity e entanglement created by ML models, ger different capabilities based on provision of different context data 8 

z e abstraction boundary erosion or subtle differences between prompts, * outputs of GenAl cannot be 

E meaningfully explained or interpreted (black box) 

Yn - x : : 

* new anti-patterns due to complexity of ML including * anti-patterns including glue code and correction cascades emerge > fashion 
(Anti-)Pat- glue code, pipeline jungles, correction cascades due to nondeterminism of the LLMs’ output’s format, * good patterns | , paradigm 
ors dead experimental code paths technical debt. are lacking due to new characteristics of LLMs, * hype attracts free- |; product 

e lack of good patterns ea who reinforce anti-patterns due to lack of development expe- | , novelty 
Data e dependency on quality and availability of data during | , dependency on quality and availability of training data lessened due 

c | Dependency | taining, e continuous validation necessary, * poor o availability of pre-trained models S product 

i) P Y | calibration of data and models P 

5 

s Domain * reusing models across domains and contexts diffi- | * problems related to transfer, generalizability, and reuse of models |; aradigi 

5 Dépendenc cult, ° transferring systems, ¢ inability of models to lessened due to generality and domain-independency of the availa- |; Bodice 

S p y generalize beyond training set ble LLMs 

S nsufficient e lack of understanding of models, libraries, and tech- | * shared knowledge base and reference communities to obtain this + novelty 

5 | Knowledge niques, Ħ lack of insight on problem at hand nowledge are missing for some application scenarios due to novelty 

é : z ‘ e bugs are present in database platforms (pinecone), orchestration 

B megaron | peene otincompatible poeramming languages, |f ameworkS, APIS due to the novel, = remewerksor ASUMCEEO | + product 

teroperability roblems J ý requent, unannounced updates to turning developed pipelines inef- 
P P ective, ¢ context window and prompt length are limited in the 
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currently available LLMs and APIs, « high latency times slow down 
he development and reduce user experience due to high load on ex- 
ernally hosted LLMs 


e problems in evaluating and debugging models, 


e trial-and-error remains necessary but relates to analyzing existing 
models through prompting rather than training new models, 


plans and pricing strategies of external providers 


Trialand Error | ° training models requires many iterations, Ħ analyzing | * GenAl’s reliance on billions of parameters makes trial-and-error for | 7 paradigm 
and understanding structure and behavior of ML raining and fine-tuning more costly and time-consuming, * need for | 7 product 
models, rial-and-error increased because models more complex and un- 

ransparent concerning, e.g., data used for training 
e hallucinations occur in LLMs’ output, ° output’s format and content 
Unreliable e difficult control of quality during development of ap- | are inconsistent, e output seems random and unpredictable, * out- ; 
Output plications and models, * analyzing and understanding | put has low quality due to reasoning mistakes, * sources used by the |^ paradigm 
he structure and behavior of the model model for answer generation are unclear and lack references, ° de- 
bugging is more difficult due to unreliable output 
e difficult to troubleshoot propagated data errors, + in- e inconsistency and variety of results from models make experiments 
a ada tit $ Pa more complex and require more experiments, * models’ capabilities A 
Experiment herent variability behind parameters, ¢ difficultto _ can change daily (e.g., models might ‘unlearn’ some capabilities) due 7 paradigm 
Complexity eocaus due to long experiments and complex in- to constant training / learning, * tracking experiment results is more Tproduct 
difficult and not supported by APIs or frameworks 
e inherent nondeterminism of most GenAl paradigms like LLMs, 
e hard to reproduce bugs and results due to non-de- paced, be only partially ee (e.8., tenperatur) but od ; 

Reproducibil- | terminism of results, e models’ behaviors cannot be tutned Off Makes test results nara tO reproduce: s missing fepraquel: $ 

‘0 ity completely predicted making validation and verifica- bility enhances where user generated or dynamically collected data |^ paradigm 
= ion difficut is used as context for a prompt due to potential interaction between 
8 lon aeut context and prompt, Ħ LLMs exhibit unstable performance with the 
= same task 
S e lack of specification to test against, * discovery and e quantitative measures are lacking for generated content due to out- 
È Testing definition of adversarial examples, > real-world testing | PUt’s complex character, e content and form of output is less predict- | , aradigm 
® | Criteria difficult due to safety or Securty 3 E | able than conventional ML due to generative nature, * opaqueness of R 
y models makes it difficult to specify adequate criteria for validation 

e identification of test oracles, * missing access to e manual EA a Se ama Sno feasible ere oes 

Testing Data | high-quality test data, e manual labor necessary for jects smat and. atthe Poor o concept Stage, e testing data and test | 5 fashion 
creation of test data, e missing test cases cases is missing due to customers’ ad-hoc interest and innovative 

i character of some projects 

Testing Tools e lack of test environments with trained ML models, * specific tools for testing of GenAl technologies are lacking due to 

and Support e lack of cross-framework and cross-platform sup- the novelty of the paradigm, * new testing processes are necessary |7 novelty 
port, ¢ lack of tools for testing for GenAl which are yet to come due to novelty of the paradigm 

External e hard to understand the effects of external ML algo- |À pe i cca appicangnis. ech rE on external, nE meee 

© | Dependen- rithms on desired qualities at runtime, ° need to deal Surpa: MOASS IS CINCU QUS 10- AC SOL CONG OVET Core IUNCUONnA INYA 2 paradigm 
= | cies with not-assured components models, e quality assurance is difficult due to black box, untranspar- | 7 product 
5 ent nature of available LLMs 
A e effective and persistent guardrails for GenAl to assure the quality 
S Quali e lack of criteria for the new types of quality features, eatures are hard to specify due to variable output, ° desired quality 
È St ty e hard to define quality standards, ° scalability of of output and possible limitations imposed by guardrails are hard to : 
andards ; TN PAnR i p 7a paradigm 
8 | and Criteria quality standards, e lack of certification, qualification, | balance due to untransparent relation between input and output, 
and standards on code quality e sufficient and adequate quality criteria for GenAl are missing (e.g., 
airness metrics for LLMs’ output) 
wage e difficult to predict or limit what input or context will be provided to 
g Inadequate ri cons sla le sae neh he system by the user (if application allows for free user input, which 
c | Context of changin el ofise e inade Cee iiserinter: most do) due to LLMs acceptability of unstructured input, e LLMs can | * product 
© | Use ied aloes ofML systems _ q handle unexpected input, but applications might inadequately pro- 
£ P y cess LLMs’ output based on such input 
T e hard to determine frequency of training due to e revalidation is hard due to frequent changes in the models delivered 
= changes in context, by external parties (e.g., OpenAl), e downstream compatibility is diffi- 
5 | Revalidation | * need to revalidate after updates, cult to assure due to changes in the models and external dependen- 
= | of Updated e updates in ML consider code, model, data as op- cies, * applications’ usability or usefulness difficult to guarantee over | * product 
E Models posed to code in legacy systems, ime as no information available on plans of the providers, * new re- 
n e changes in model behavior reduce users’ trustand | quirements for validation since validation is necessary of the applica- 
downstream/backward compatibility ions and prompts used therein and not the models themselves 
a Economie o l e new cost structure is necessary due to pay-per-use policy of major 
E | and Compu- e timing, memory, and energy constraints, e costs of | providers including costs of exploring external models’ potentials and 
> ting Re- training, * long training times for iterations, * need for | costs of using external models in the applications, * resources nec- | * product 
©: eaurees adequate computing power essary for training of own GenAl models are incomparable to conven- 
2 ional ML due to models’ complexity and size 
© 
g 
a even though external dependencies are considered |5 S Sieations rly on external models, access to APIs and providers 
8 Third-Party ee ee iene infrastructure, Ħ the usability and popularity of FL’s applications de- * product 
Dependency s : : pends on speed, availability, and compatibility of external compo- 
2 a idles for software configuration man- nents, * planning of economic resources is difficult due unclear 
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no) 
8 52 Engineer- | e need for continuous engineering, * engineering is ad-hoc and results from short-term demand driven, | , fashion 
a2 ing e necessity for ad-hoc development e.g., by hype around GenAl 
wot 
n= 
Customer e hard to negotiate unfeasible expectations, *hardto | * common ground with clients from outside IT context is difficult due 
Communica: convince clients to pay continuously for improve- to hype and inflated expectations around GenAl, e communication alfashion 
tion ments, Ħ need for skills to help customers set feasible | with clients is difficult because clients rely on misinformation and in- 
targets / specify requirements flated promises regarding GenAl 
: A 4 : e clients without experience in IT projects refuse to participate in re- 
End-user A pente cus om technica iSo mtions ratier tianirm-. quirements elicitation or provide requirements which hare not useful, | , fashion 
g | Engagement Een earl enough e hype-driven clients expect fast results neglecting the need to in- 
3 y volve end users in design 
o ape x 3 
a e difficult to explain to clients why models produce t peitormanceo i we as O ere So, Fa 
£ Explanations madel ipese aE or WE ness and inherent variability of the models make it difficultto under- |^ paradigm 
3 Ment RHM ime p P p stand output and make reliable statements, ¢ clients blame freelanc- 
o ers for low performance of the models 
g 
£ e skilled freelancers are lacking due to increased hype-driven de- 
w | Workforce e need for diverse skills, ° lack of diverse skills in mand for their skills and experience, e freelancers without adequate A 
D | Skills teams, « lack of adequately educated engineers experience join the market due to enhanced demand and potential | * fashion 
j q y 8 for lucrative jobs, * staying on top of dynamic market changes and 
technical developments causes major effort to keep pace with it 
Communi e community to exchange with is difficult to find, especially on use of 
and Y | the analyzed literature does not explicitly mention LLMs and GenAl for specific domains, due to novelty and niche char- | x fashion 
Competition community-related topics as challenges acter of some projects, * competition increases due to many new x novelty 
P freelancers entering the market 
: ; ° : e estimations are difficult to make due to lack of experience, ¢ limited 
T ° uncertainty i estimating development time and _ | ability to say upfront if a specific case can be covered by GenAl tech- 
cost, * no well-defined input-output factor for valida ^ novelty 
Planning tion. ®@ aseessnent of long teri botentialbased on nologies due to lack of experience, ° clients expect that a certain task atashion 
Uncertainty shortferm metrics- ei E seibily to make suaran? will be solved with GenAl despite more adequate and easier-to-plan | , paradigm 
A tebs on RAS A 8 alternatives, e.g., conventional ML, ° reasonable offer / putting a price 
S tag on a project difficult to due to lack of heuristics and experience 
E | Resources na : e costs related to experimentation during prompt-engineering are 
So Shortage and ieee al Speeds sneer lack of organ- hard to manage and control especially if the client does not cover 7 product 
S Costs those costs 
= e preselecting projects is hard based on the requests from clients 
H due to lack of experience, ; i 
Pre-Selection | the analyzed literature does not explicitly mention e fashionableness causes many unusual or incomplete requests and | * fashion 
Effort challenges related to bre-seleation of orciects assessing them requires effort to explore whether GenAl technolo- %* product 
8 P proj gies are useful to solve the requests, * novelty 
e frequent changes and instability of the models might invalidate pre- 
selection decisions 
e hard to discover what data exists and where, 
S Data Access |, difficulty to collect new data and prepare it 
© 
a Data * hard to control, version, and deploy data e no need for training data and training data quality assurance due to 
£ | Management the availability of pre-trained models 
w . 
Q * need to combine, transfer, clean, and transform g , , , , 
€ | Data data, FL refer to challenges related to data if their projects involve (a) train- | $ product 
2 Preparation |e lack of tools for data preparation, ing of fine-tuning models or (b) retrieval augmented generation; in 7 paradigm 
© * inconsistent data types and quality both cases the challenges confirm many of the challenges listed for 
po — - conventional ML, larger datasets are necessary for training a genera- 
Z e Sones aoe qara quality, ae tive model. 
© ; e missing values and rare cases in data, 
= Data Quality |, ack of tools to annotate and assure data quality, 
e bias in data 


5 DISCUSSION 


Freelancers report on a multitude of challenges they experience 
when developing applications including GenAI components. They 
are surprised by the magnitude of some challenges, e.g., issues re- 
lated to stay on top of changes, while others appear completely 
new to them, e.g. impracticality of separating functionalities. 
Given the increasing role of freelance developers in digital trans- 
formation [25-29, 39], their insights are invaluable to the broader 
community. Simultaneously, the results provide information about 
freelancers and their work, such as the importance of the adequate 
projects’ selection. In the following, we interpret our findings and 
suggest avenues of research for SE4GenAI and for HypeSE. 
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5.1 Software Engineering for Generative AI 


What does GenAI imply for freelancers? Freelancers suggest that 
GenAI equips them to develop potent applications within a con- 
strained timeframe. Rather than engaging in expensive and 
lengthy processes of data collection, preprocessing, and training, 
they delve into prompts that can be efficiently handled by an indi- 
vidual. Freelancers from non-technical backgrounds express that 
GenAI has unlocked new avenues for them, fostering opportuni- 
ties for economic and professional growth. They leverage freelance 
jobs to enhance their skills and render their freelance career more 
sustainable. However, they require practical guidance, illustrative 
examples, and consistent interaction to achieve this. 


How freelancers explore Generative Al? 


The freelancers would have appreciated more structured guid- 
ance on GenAI technology, both as a concept and a product. They 
reported unexpected issues stemming from hallucinations, chal- 
lenges in handling context data during prompting, and complica- 
tions associated with prompt engineering. Their pursuit of solu- 
tions and understanding of these phenomena across the Internet 
was often messy, leading to a disorganized process and outcome. 
Our own research indicates that comprehensive, professional 
guidance on potential challenges is in its infancy and fragmented 
across platforms and communities. [30, 34-36, 60, 64]. Freelancers’ 
ability to improvise becomes a core asset in this situation. Short 
trial-and-error typical for working with GenAI turns improvisation 
into the actual value offering of freelancers. This, however, comes 
not without challenges for SE. 

Freelancers are particularly affected by challenges related to 
characteristics of GenAI paradigm during SW design, construction, 
testing, and quality assurance due to hallucinations, low reproduc- 
ibility, reasoning errors, inherently probabilistic output, and the 
CACE principle. Also, insufficient explainability leads to issues in 
contact with customers affecting freelancers’ professional practice. 
Whereas the technical challenges are known [30, 34], our results 
documents their impact on day-to-day SE practice of freelancers. 

The characteristics of GenAI products influence SW construction, 
SW maintenance, SW configuration management, and SE manage- 
ment. Factors influencing this situation include the constant 
changes and opaque update cycles of leading providers like 
OpenAI, the availability and latency of models (including not just 
inference latency, but also latencies from workload distribution), 
cost structures and policies enforced by providers, and the minimal 
control over model behavior. Currently, freelancers do not view 
any significant alternatives to these major providers due to the re- 
stricted capabilities of open-source models or the prohibitive costs 
of fine-tuning and training. This indicates that the question of 
whether GPT is all one needs [64] will remain valid for the years 
to come. The dependency on external providers and their products 
was a marginal issue in earlier SE4AI literature [23, 42]. Yet, the 
fact that training and hosting own GenAI models is frequently im- 
practicable, the issue of managing external dependencies rises. 

Based on freelancers’ reports, we propose a new area of re- 
search, Software Engineering for Generative AI (SE4GenAI). We 
propose the following key components that SE4GenAI must incor- 
porate: (1) systematic guidance for activities like managing the 
trial-and-error loop as the primary mode in making progress dur- 
ing the SW construction; (2) strategies and metrics for reliable test- 
ing, quality assurance, and downstream compatibility to handle 
GenAl’s variable outputs and dynamic capability changes, includ- 
ing ways to make adequate guarantees concerning functionality; 
(3) guidance on dealing with external dependencies impacting 
functional and non-functional aspects of the system during 
maintenance as well as the cost structure of the product; (4) sup- 
port for more effective explanation practices in communication 
with the clients about (at least currently) unexplainable and un- 
predictable model behavior. Equipped with that knowledge, free- 
lancers in our study would have had a better chance to complete 
their products more efficiently. 


5.2 Hype-Induced Software Engineering 


What does hype mean for freelancers? The escalating excitement 
around GenAI has significantly influenced the dynamics of the 
freelance market. Our data clearly indicates that the success of 
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freelancers is heavily dependent on job platforms and their respec- 
tive communities. To maintain their success, freelancers must 
master the art of navigating these platforms, which may some- 
times involve declining projects of personal interest if the likeli- 
hood of success seems uncertain. In this era of heightened interest, 
the key strengths of freelancers are their autonomy, agility, and 
flexibility. Particularly for those who work on short, time-bound 
projects, they can swiftly adapt to new technologies and capitalize 
on the current buzz. However, this also means that freelancers 
must be able to acquire new skills quickly and independently, a 
process that demands both time and effort. 

The integration of GenAI and the escalating demand for devel- 
opers has presented new challenges that have affected freelancers, 
their work routines, and even their personal lives. We hypothesize 
that the rapid advancement of technology will likely generate sim- 
ilar trends in the future, and the impact of these trends on the IT 
development sector will extend beyond freelancing. This could po- 
tentially disrupt many established SE practices and guidelines. We 
argue that the strategies employed by experienced freelancers to 
navigate the hype provides valuable insights for SE research. Con- 
currently, formalizing this knowledge about navigating trends 
could be beneficial for new freelancers seeking guidance. 

Fashionableness of GenAI is primarily reflected in freelancers’ 
comments concerning new or amplified challenges on SW require- 
ments, SE professional practice, and SE management. The hype in- 
cites unexperienced clients and unexperienced, low-skilled free- 
lancers to enter the market. Clients come with exacerbated, misin- 
formed expectations and unclear or hard-to-measure targets. Fre- 
quently, they are simply interested in solution-based probing [9] 
to discover new opportunities. Unexperienced freelancers, even if 
trained as ML specialists or software developers, lack tools and 
skills to interact with such clients in a professional way. 

Freelancers are now operating in a highly competitive environ- 
ment that necessitates careful project selection and intense pre- 
contract solution testing. This challenge, which has not been ad- 
dressed in previous SE4AI research [42]or SE hype literature [19], 
is substantial. We hypothesize that this will lead to a reciprocal 
effect on the client side, particularly among inexperienced busi- 
nesses and startups. These entities must undertake a complex 
search for suitable talent, prompting freelancers to engage in re- 
sume-driven development to establish credibility [19, 20]. The dy- 
namics of GenAI hype differ from previous trends such as block- 
chain. For instance, the development of smart contracts necessi- 
tated extensive client-engaged requirements engineering [14, 49], 
which made the freelancers’ work transparent and accountable, 
compelling the client to contemplate their objectives. Moreover, 
blockchain's use cases were primarily confined to finance and 
value capture [12, 14, 41, 43, 49], whereas GenAI encompasses a 
wider range of applications. This compels freelancers to rapidly 
acquire the requisite domain expertise. 

Novelty of GenAI further amplifies effects created by the 
GenAl’s fashionable character concerning SW requirements, SE 
management, and SE professional practice. It increases planning un- 
certainty due to lack of heuristics, reference projects, and refer- 
ence communities, which makes the overall situation harder. The 
novelty encourages ad-hoc efforts and quick fixes, anti-patterns in 
SW design, and inefficient communication in SE practice. 

Considering the likelihood of future hypes, we advocate for re- 
search in Hype-Induced Software Engineering (HypeSE). Free- 
lance and independent developers, due to their flexibility and agil- 
ity, are most impacted by hypes and require support. HypeSE 
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should incorporate: (1) guidance on managing inflated expecta- 
tions and communicating these to clients and users, extending ex- 
isting literature on requirements elicitation and management; (2) 
processes and tools for prioritizing and managing project oppor- 
tunities to support economic stability and mitigate risks associated 
with riding high waves; (3) support to help developers manage in- 
coming hype information, including fostering effective commu- 
nity building and facilitating expert information exchange; (4) def- 
inition of new roles such as domain-specialist freelancers who pro- 
vide in-depth insight and consultancy to clients based on expertise 
collected in similar projects assuring diffusion of knowledge be- 
tween client companies and enabling them to spot opportunities 
for clients. To accommodate for such new roles, we claim CS and 
SE education should focus on equipping future developers with 
specific domain skills rather than producing domain-agnostic gen- 
eralists. This could foster directions like SE for the financial indus- 
try, life sciences, or social media. 


5.3 Generalizability and Threats to Validity 


While this study highlights the need for SE4GenAI and HypeSE 
based on freelancer experiences, we propose that other stake- 
holder groups and professions may face similar, related, or com- 
plementary challenges. Further research is required to compre- 
hend these challenges, emphasizing the unique aspects of each 
stakeholder group. For example, issues related to imprecise or ex- 
aggerated software requirements may be interpreted differently by 
clients, contractors, and requirements engineers. Addressing these 
problems necessitates understanding the mechanisms that pro- 
duce these challenges for each group. This viewpoint would sup- 
plement the currently prevalent approach, which attempts to ag- 
gregate software engineering challenges and their solutions at the 
technology level, such as GenAI [18, 30, 34, 63, 64] or AI/ML [23, 
40, 42, 44, 47]. We recognize existing efforts to concentrate on de- 
velopers as a stakeholder group [59] and specific studies focusing 
on particular areas of SE, including requirements engineering or 
testing [32, 48]. We advocate for the amplification of these efforts 
by incorporating the perspectives and specific contexts of project 
managers, client companies, ML model providers, and so on. 

Interview studies risk low interpretation validity due to sugges- 
tive questioning, personal bias, and coder assumptions. We miti- 
gated this by allowing interviewees to begin with an open narra- 
tive, capturing extensive information without a specific focus. An 
uninvolved researcher coded the data to minimize bias. We main- 
tained a theory-agnostic approach to avoid guiding interviewees 
or analysis. However, we acknowledge the first author's prior in- 
volvement in Al-based studies may have influenced him. 

This study's internal validity may be influenced by context and 
respondent biases, as it relies on reports potentially affected by re- 
cent events or second-hand reflections. The external validity could 
be impaired due to the nascent nature of GenAI, which is yet to 
stabilize. We mitigate this by distinguishing challenges linked to 
GenAl's characteristics from those with different origins. Addi- 
tionally, we sought to improve triangulation by interviewing a di- 
verse group of developers. 


6 CONCLUSION AND FUTURE WORK 


This study was conducted with the objective of pinpointing the 
challenges freelancers face in relation to GenAI. Freelancers, who 
serve as ideal collaborators for small to mid-sized enterprises 
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exploring new technologies, provided insights into the challenges 
they perceive as new or significantly heightened due to the hype 
surrounding GenAI, as well as GenAl’s inherent and product-re- 
lated features. Freelancers have observed a shift in workload from 
data management and model training to model interaction and 
output management, which complicates their ability to rely on 
previous intuition and experience. Moreover, the hype around 
GenAI contributes to problems arising from overblown expecta- 
tions and swift changes that demand the freelancer's attention. 
This situation parallels notable historical events such as the gold 
rush or the exploration of the wild west, where independent pio- 
neers ventured into uncharted territories. 

We believe the SE community should monitor future develop- 
ments in this field. The findings of this study could be reinforced 
through additional qualitative research or large-scale confirma- 
tory surveys. Ethnographic studies based on in-situ observation of 
larger development teams could offer insights into how freelancers 
tackle practical challenges and uncover unexplored issues. We also 
recommend normative and prescriptive research to develop new 
standards and techniques for software engineers and managers. 

The study provides practitioners with insights into the chal- 
lenges of SE for GenAJI-based solutions. The provided list of SE 
challenges is thought to sensitize them for specific issues to 
emerge. The study also educates practitioners about the impact 
and risks of hype. Especially freelancers depending on external 
contracts are subject to those risks. SE researchers gain an empir- 
ical basis for deriving guidance for practitioners and become more 
aware of hype as an inherent element in SE. Its influence extends 
beyond the context of GenAI and directly affects SE practice. We 
suggest addressing these challenges through intensified research 
on SE4GenAI and HypeSE. 
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Results / Discussion 


Extended Version of Table 2 


The table below is an extended version of Table 2 from the manuscript. It includes subcategories which 
were identified in the literature but were either confirmed by freelancers without further deliberation (©) 
or no comments were made by freelancers on the given category of challenges (©). The star, x, points to 
categories for which freelancers identified unique or new challenges. 7 indicate which factors contribute 
to upholding challenges regarding GenAl in the category. x informs when a factor lessened the impact of 
challenges previously reported for AI/ML. ^ indicates factors that significantly amplify challenges 
belonging to the given category. 


Software Requirements 


Expectations 


- unrealistic expectations toward adoption 


- clients demand use of GenAl despite mismatch between the 


Challenges and Needs in SE4AI / SEAML > Impact 
KA |Subcategory literature [1, 2, 4, 5] Challenges and Needs Mentioned by Freelancers (FL) Factors 
- clients have too high expectations towards abilities of Al mtashion 
i Tad ò : 
Customers unrealistic 100% accuracy expectations, because of the hype around GenAl 


(e.g., running system with too little data) ; l hha * novelty 
business requirements and GenAl’s capabilities 
- lack of literacy concerning the - clients use demanding/unrealistic projects to learn about 
Customers’ om limitati f GenAl ing risks for FL’ . 7 novelty 
a capabilities of Al, imitations of GenAl generating risks for FL’s rating 
Limited nee ‘ 
Knowledge - lack of knowledge on quantitative metrics |- new non-technical clients request GenAl-based solutions atashióñ 
to measure those capabilities due to inflated expectations about GenAl’s capabilities 
Data Limits - data limitations go against requirements, |FL do not refer to those and similar problems in relation to o 
Requirements |- extraction of features from data difficult |Software Requirements 
— . - clients and FL lack effective statistical measures for quality 
- statistical metrics do not match : 
: assessment of generated content (as opposed, e.g., to * paradigm 
requirements, aas NO 
a , , precision and recall measures for classification tasks) 
- statistical metrics do not match business 
Metrics vs. metrics, 
Requirements |- difficult to use requirements coverage - reliable ground truth is very difficult / impossible to create so 


method, 
- lack of coverage-oriented datasets, 


evaluation cannot be reliably automatized 


Software Design 


` AN - clients lack adequate business or quality criteria for new ^ fashion 
- no operationalization of coverage for ML . ; : 
tasks if they haven’t been previously conducted by humans * novelty 
- fragmented/incomplete knowledge of 
new or ML-typical non-functional 
New Types of i he FL confirm the presence of vague and hard-to-elicit 
: requirements, f e 
Requirements ; a requirements 
- requirements hard to elicit / vague / not 
apparen 
- difficult dependencies between all parts |- orchestration was very difficult after release of some LLMs 
of ML-based system, due to their novelty but lessened with new orchestration ^ novelty 


Components 


- additional complexity due to distributed 


frameworks 


Orchestration Jarchitecture, Pree i D 
- hard-to-manage interactions between - downstream compatibility is nard to achieve and maintain * paradigm 
various ML models, due to nondeterminism of GenAl’s output format 
- singular LLM’s capabilities (e.g., generating answer vs. 
generating summary) are impossible to separate technically |x paradigm 
- changing anything changes everything, because they rely on a single model 
Inh t - paradigm shift to pipeline-driven / 
oe A a 8 ‘ : pib - similar or seemingly identical prompts can trigger different 
Complexity of |system-wide view, se pia ; , 
capabilities based on provision of different context data or * paradigm 
ML - entanglement created by ML models, : 
, subtle differences between prompts 
- abstraction boundary erosion 
- outputs of GenAl cannot be meaningfully explained or x di 
interpreted (black-box) paradigm 
- new anti-patterns due to complexity of - anti-patterns including glue code and correction cascades a paradigm 
emerge due to nondeterminism of the LLMs’ output’s format |7 product 


(Anti-)Patterns 


ML including glue code, pipeline jungles, 
correction cascades, dead experimental 
code paths, technical debt, 

- lack of good patterns 


- hype attracts freelancers who reinforce anti-patterns due to 
lack of development experience 


a fashion 
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i TAON - dependency on quality and availability of training data vrait 
- dependency on quality and availability of lessened due to availability of pre-trained models P 
Data data during training, 
Dependency - continuous validation necessary, FL confirm such problems wrong format, inadequate data, too 
- poor calibration of data and models much or not enough data in context of retrieval augmented e 
generation 
- reusing models across domains and - problems related to transfer, generalizability, and reuse of s paradigm 
contexts difficult, models lessened due to generality and domain-independency P 
Domain i i of the available LLMs “produet 
Bepantieney - transferring systems across domains, 
- inability of models to generalize beyond |- some languages (Arabic) are insufficiently covered by LLMs 
training set which was also a problem in earlier NLP solutions 
- lack of understanding of models, o. r 
S A i : - shared knowledge base and reference communities to obtain 
Insufficient libraries, and techniques, i maak L : 
. ne this knowledge are missing for some application scenarios * novelty 
Knowledge - insufficient knowledge about the problem 
due to novelty 
at hand 
Ae - inflexible Al/ML development tools, r 7 
Insufficient > FL did not mention problems related to development tools 
- fragmented toolchains, © 
Tools . (such as IDEs). 
- lack of tool infrastructure 
c - bugs are present in database platforms (pinecone), a novelty 
2 orchestration frameworks, APIs due to their novelty 
[$] A $ $ 
f - presence of incompatible programming |. frameworks or APIs undergo frequent, unannounced updates Sproduë 
2 languages, to turning developed pipelines ineffective P 
8 Integration - bugs in libraries, frameworks, and 
v platforms - context window and prompt length are limited in the currently ^ produc 
$ - interoperability problems available LLMs and APIs 
5 - high latency times slow down the development and reduce Poradie 
m user experience due to high load on externally hosted LLMs P 
- trial-and-error remains necessary but relates to analyzing a paradigm 
existing models through prompting rather than training new a prodüs 
del 
- problems in evaluating and debugging eee 
models, - GenAl’s reliance on billions of parameters makes trial-and- 
Trialand Error |- training models requires many iterations, |€rror for training and fine-tuning more costly and time- 7 produc 
- analyzing and understanding structure 
and behavior of ML models, - need for trial-and-error increased because models more a paradigm 
complex and untransparent concerning, e.g., data used for 
ee ” product 
training 
- difficult control of quality during 
Unreliable development of applications and models, 
Output - analyzing and understanding the 
structure and behavior of the model - sources used by the model for answer generation are unclear 
and lack references 
- debugging is more difficult due to unreliable output * paradigm 
- inconsistency and variety of results from models make See adie 
- difficult to troubleshoot propagated data [experiments more complex and require more experiments P 8 
errors, 
E ; t : ; ee . > 
xperimen - inherent variability behind parameters, - models capabilities can change daily (e.g., models might ^ product 
Complexity en unlearn’ some capabilities) due to constant training / learning 
- difficult to trace results due to long 
experiments and complex interactions - tracking experiment results is more difficult and not nBrGduet 
supported by APIs or frameworks P 
oo 
= - inherent nondeterminism of most GenAl paradigms like 
o LLMs, which can be only partially reduced (e.g., temperature) |^ paradigm 
2 - hard to reproduce bugs and results due uP y (6-8 P ) P 8 
v a but never turned off, makes test results hard to reproduce 
= to non-determinism of results, 
3 |Reproducibility |- models’ behaviors cannot be completely |- Missing reproducibility enhances where user generated or 
S i i idati dynamically collected data is used as context for a prompt due|^ paradigm 
P predicted making validation and 
verification difficult to potential interaction between context and prompt 
- LLMs exhibit unstable performance with the same task * paradigm 
ae f - quantitative measures are lacking for generated content due Sinapadiaen 
- lack of specification to test against, to output’s complex character p g 
Testing Criteria |- discovery and definition of adversarial 
- content and form of output is less predictable than : 
examples, * paradigm 
conventional ML due to generative nature 
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- real-world testing difficult due to safety or |- opaqueness of models makes it difficult to specify adequate 


security criteria for validation * paradigm 
- identification of test oracles, - manual creation of testing data is not feasible because many E A 
- missing access to high-quality test data, |projects small and at the proof-of-concept stage 
Testing Data - manual labor necessary for creation of 
test data, - testing data and test cases is missing due to customers’ ad- rashin 
- missing test cases hoc interest and innovative character of some projects 
- costs and resources for systematic tests, 
Testing - management of tests and their results, FL do not see testing management as an issue since most of 
Management - unnecessary repetition of tests, them do not engage in systematic testing. & 


- keeping track of tests 


- lack of methods for preparation of 
Testing testing, 

Standards - major role of developers’ assumptions in 
simulation test 


FL confirm the lack of standards and processes to guide 
testing, but due to low importance of systematic tests in their |} 
projects, it is not a core issue. 


- lack of test environments with trained ML |- specific tools for testing of GenAl technologies are lacking 
del s 7 novelty 
; models, due to the novelty of the paradigm 
Testing Tools 
- lack of cross-framework and cross- 
and Support i ; 
platform support, - new testing processes are necessary for GenAl which are yet neveli 
- lack of tools for testing to come due to novelty of the paradigm y 
- hard to understand the effects of external |- quality assurance of applications, which rely on external, a paradigm 
ML algorithms on desired qualities at non-assured models, is difficult due to lack of control over Abroduet 
External runtime core functionality models progug 
Dependencies f l : 
- need to deal with not-assured - quality assurance is difficult due to black-box, untransparent |7 paradigm 
components nature of available LLMs 7 product 
- new, ML-specific regulatory and ethical r : 
New pes ot 3 : P g y FL have concerns about the quality features but do not list any 
Quality Features|consideration of ML models, co's. g 
. j new and do not indicate that any of those types amplified due |© 
(Ethics, - lack of established methods to assure 
> |fFai ' . to GenAl. 
£ Fairness, ...) fairness, safety, and security of ML 
oO 
5 . ‘ . 
fod - lack of criteria for the new types of quality |" effective and persistent guardrails for GenAlto assure the ” paradigm 
g f quality features are hard to specify due to variable output 
5 eatures, 
Š Quality - hard to define quality standards, - desired quality of output and possible limitations imposed by 
a Standards and _|- scalability of quality standards, guardrails are hard to balance due to untransparent relation 7a paradigm 
Criteria - lack of certification and qualification, lack}between input and output 
of standards on code quality for ML - sufficient and adequate quality criteria for GenAl are missing |7 paradigm 
systems (e.g., fairness metrics for LLMs’ output) 7 novelty 
- impossible to collect data which is 
: necessary for assuring quality (e.g., , , “5, 
Quality Panes t z FL did not mention any specific challenges related to lack of 
individual demographics to assure j © 
Assurance Data |,_. data for quality assurance. 
airness), 
- lack of coarse level methods 
- hard to prevent hacking or game rewards 
i unction, 7 : . : i 
Adversarial : : n x FL did not mention purposeful malign behavior and adversarial 
- risks of extraction of model information ie . ‘ e 
Attacks : attacks as major risk for their projects. 
by querying, 
- risks of recovering training set 
- risk of feedback loops, concept drift, 
; luctuations in input data, or manipulation, A À : A iun 
œ |Detecting Model A ; FL did not mention model risks as a significant challenge for 
ofo - large scope of tracking and controlling A i © 
g [Risks themselves, but they see it as a challenge for the providers. 
7 change, 
£ - need to avoid damage to environment 
= - undeclared consumers who use system |- difficult to predict or limit what input or context will be 
g notin line with original purpose or provided to the system by the user (if application allows for 
S 4 x ‘ i ae * product 
2 inadeauats intentions, free user input, which most do) due to LLMs acceptability of 
= - i i i unstructured input 
© |Contextof Use teas to changing environment of p 
- inadequate user interfaces on top of ML |” LLMs can handle unexpected input, but applications might we creaiiiee 
inadequately process LLMs’ output based on such input 
systems 
- revalidation is hard due to frequent changes in the models 
Revalidation of |- hard to determine frequency of training ^ product 


delivered by external parties (e.g., OpenAl) 


Updated due to changes in context, 
Models - need to revalidate after updates, 


- downstream compatibility is difficult to assure due to 
changes in the models and external dependencies 
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- updates in ML consider code, model, 
data as opposed to code in legacy 


- applications’ usability or usefulness difficult to guarantee 
over time as no information available on plans of the providers 


^ product 


systems, 

- changes in model behavior negatively - new requirements for validation since validation is necessary 

impact users’ trust and of the applications and prompts used therein and not the * product 
downstream/backward compatibility models themselves 


- keeping track of datasets and their 


Tracking Data FL did not refer to challenges related to tracking data. © 
metadata 
- technical debts presentin ML FL mention need for infrastructure if they need to host data for 
frameworks, retrieval augmented generation or if they want to host their own 
Infrastructure z : i 
- need for reliable infrastructure to run the |model (two cases in the data), however those problems are 
model not considered new or amplified. 


- new cost structure is necessary due to pay-per-use policy of 
- timing, memory, and energy constraints, |major providers including costs of exploring external models’ 


%* product 


5 Economic and | cots of training potentials and costs of using external models in the 
Computin ý icati 
E puting - long training times for iterations, applications 
g [Resources : se 
© - need for adequate computing power - resources necessary for training of own GenAl models are ; 
c ; ; ; . * paradigm 
= incomparable to conv. ML due to models’ complexity and size 
5 FL refer to speed and workload issues of external models, 
ar - need to scale models at the end of TEE fae te ef $ Ph 
© Scalabilit rodüctiön indicating that this might be related to their scalability. This 
© y p , R negatively impacts the scalability of FL’s own applications, but 
= - need for scalable oversight for models iar ae À 
5 itis similar to earlier challenges. 
oO 
g Short-Lived - Al-based system can be invalidated due |FL acknowledge that GenAl is subject to the hype, but do not 
$ Technology to trend changes expect that the trend will disappear quickly. 
8 - dependency on external providers is significantly higher 
even though external dependencies are because applications rely on external models, access to APIs, |X product 
considered for quality assurance or and providers’ infrastructure 
Third-Party development, considered literature does 


- the usability and popularity of FL’s applications depends on 
speed, availability, and compatibility of external components 


Dependency not explicitly attend to implications of 
external dependencies for software 
configuration management - planning of economic resources is difficult due unclear plans 


and pricing strategies of external providers * product 
3 i ’ - need for continuous engineering, - engineering is ad-hoc and results from short-term demand ; 
Engineering i $ a fashion 
2 - necessity for ad-hoc development driven, e.g., by hype around GenAl 
3 
oe - need for a highly iterative life-cycle f . - , 
=? FL did not notice specific challenges concerning SE process 
aṣ = |Lack of model, , 
or r ; models as they have engaged in ad-hoc development before. 
o © |Adequate |-lack of guidance on various aspects, : 7 e 
oF . . However, they agree that they could benefit from more precise, 
° Process - lack of well-defined, widely accepted ; A ‘ : 
a systematic guidance on how to deal with the new paradigm of 
i Models process model, GenAl 
2 - lack of processes for annotation 
- hard to negotiate unfeasible - common ground with clients from outside IT context is Shiba 
expectations, difficult due to hype and inflated expectations around GenAl 
Customer - hard to CONVINCE CLIENTS to PAY [peee niin 


Communication |continuously for improvements, Sk : , Tan : 
: - communication with clients is difficult because clients rely f 
- need for skills to help customers set — i , : y 7 fashion 
3 7 on misinformation and inflated promises regarding GenAl 
feasible targets / requirements 


g l i ; - clients without experience in IT projects refuse to participate 
PH - clients focus on technical solutions in requirements elicitation or provide requirements which hare |7 fashion 
© End-user rather than involving end-users, not useful 
& |Engagement - skepticism of users who were involved f 
= not early enough - hype-driven clients expect fast results neglecting the need to e fashion 
2 involve end users in design 
N 
£ - difficult to explain to clients why models |- performance of LLMs and GenAl models difficult to explain to ^ paradigm 
à produces certain output, clients due to lack of knowledge about how models work 
D Explanations - insufficient explainability for some model |_ opaqueness and inherent variability of the models make it i 
types, icult to understand output and make reliable statements t paradigm 
- explaining the potential for improvement 
over time - clients blame freelancers for low performance of the models |^ paradigm 
Seiten - need for continuous chain of human FL acknowledge that Al technologies have societal 
Concerns responsibility, implications and try to implement guiderails to preventthem |© 
- ethical considerations, (see SW Quality). 
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- proliferation of fake content based on 
generation, 

- speed of technical progress higher than 
regulation 


FL see the risk of job losses in the society, yet do not consider 
ita challenge for their own development activity. 


Team 
Composition 


- variety of disciplines and backgrounds 
involved in Al projects, 


FL work alone or in small teams of FL (when a freelancing 


and - overlaps between data scientists and agency) and did not mention problems concerning team © 
F engineers, composition or collaboration. 
Collaboration g i R 
- slow start to onboard non-engineers 
- skilled freelancers are lacking due to increased hype-driven ; 
* fashion 


Workforce Skills 


- need for diverse skills, 
- lack of diverse skills in teams, 
- lack of adequately educated engineers 


demand for their skills and experience 


- freelancers without adequate experience join the market due 
to enhanced demand and potential for lucrative jobs 


- staying on top of dynamic market changes and technical 


* fashion 


Community and 
Competition 


the literature does not explicitly mention 
community-related topics as challenges 


and niche character of some projects 


- competition increases due to many new freelancers entering 
the market 


A RENA * fashion 
developments causes major effort to keep pace with it 
- community to exchange with is difficult to find, especially on 
use of LLMs and GenAl for specific domains, due to novelty * novelty 


* fashion 


SE Management 


Planning 
Uncertainty 


- uncertainty of estimating development 
time and cost, 

- no well-defined input-output factor for 
validation, 

- assessment of long-term potential based 
on short-term metrics, 

- impossibility to make guarantees on cost- 


- estimations are difficult to make due to lack of experience 


- limited ability to say upfront if a specific case can be covered 
by GenAl technologies due to lack of experience 


- clients expect that a certain task will be solved with GenAl 
despite more adequate and easier-to-plan alternatives, e.g., 
conventional ML 


- reasonable offer / putting a price tag on a project difficult due 


^ novelty 


^ novelty 


” fashion 
7a paradigm 


effectiveness ite i * novelt 
to lack of heuristics and experience y 
- amount of reviews, software updates, 
Regulatory ? 2 f 
cycles of data collection or annotation FL did not refer to regulatory aspect as a challenge for them. © 
Demands 
demanded by law 
Resources - resource limits curbing down efforts, - costs related to experimentation during prompt-engineering 
Shortage and - lack of organizational incentives, are hard to manage and control especially if the client does ” product 
Costs - lack of resources not cover those costs 
- quantification and assessment of the FL did not refer to this aspect as a challenge for them, they 
Assessment of y : ; 
Gana knowledge generated during the mention that freelancing allows for learning, and they see each |e 
development of the system project as learning opportunity. 
- preselecting projects is hard based on the requests from 
* novelty 


Pre-Selection 
Effort 


the literature does not explicitly mention 
pre-selection as challenges 


clients due to lack of experience 


- fashionableness causes many unusual or incomplete 
requests and assessing them requires effort to explore 
whether GenAl technologies are useful to solve the requests 


- frequent changes and instability of the models might 


invalidate preselection decisions A product 
- hard to discover what data exists and 
where itis, 
Data Access a 
- difficulty to collect new data and prepare 
it FL refer to challenges related to data if their projects involve (a) 
« [Data - difficult controlling, versioning, deploying training of fine-tuning models or (b) retrieval augmented 
t anagement Srdató generation; in both cases the challenges confirm many ofthe |” paradigm 
a challenges listed for conventional ML, larger datasets are 
= - need to combine, transfer, clean, and necessary for training a generative model. 
o |Data transform data, 
a i : 
og |Preparation - lack of tools for data preparation, 
io - inconsistent data types and quality 
a SSS eee SS SE, EEE S 
T - difficult to assure data quality, 
- missing values in data, 
; - missing rare cases in data, - no need for training data and training data quality assurance 
Data Quality g Sie g : g a y s product 
- lack of tools to annotate and assure data |due to the availability of pre-trained models 
quality, 
- bias in data 
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2 Method 


2.1 Participants Acquisition 


2.1.1 Job Description Posted to Upwork 


| am looking for developers who create software applications involving intelligent features based on 
generative Al (including any of the recent large language models): it might involve web apps, mobile 
apps, plug-ins or packages towards bigger projects. | want to learn about problems and challenges 
you encounter in your daily work with GPT, LLaMA, LaMDA, and others, their APIs, or your own 
generative models, how you solve those challenges, and what would you wish for in the future. 

| do not need any sensitive information (employer, name of the project/app, country, private 
addresses, etc.), but some basic information about the functionality might be necessary for me to 
understand the use case. 


These are the selection criteria for interview participants: 

- software developer or software project manager with 2 years of experience or more, 

- participated in 2 or more development projects involving the use of generative Al platforms / APIs, 

- the developed software uses, processes, or interprets the output from the API (i.e., at least one of 
your projects/apps should be more than simply an adapted/extended version of the OpenAl's 
playground). 


Your tasks involve: 

- filling out a short survey (5-10 min) prior to the interview including an informed consent for 
participation, 

- participating in an online interview via zoom (approx. 45-50 min). 


Each participant will be paid 35 USD after the completion of the interview. 


The dates for the interview will be arranged individually. The interviews will be audio-recorded. Later 
on, a research team will anonymize, transcribe, and analyse them. 


If you think you're the right participant for the interview, drop me a line. 


2.1.2 Question Asked in the Proposal Phase 
1. What generative Al platforms or APIs did you use in your past projects? 
2. How many years of experience as software developer do you have? 
3. Whatis your highest training / education degree? 
4. How many from your recent projects used a generative Al platform or API? 


2.1.3 Conflict of Interests Declaration 


The authors, unaffiliated with Upwork Inc., did not receive any benefits from the company or its 
subsidiaries at any time. The choice was based on specific criteria: (1) the ability to post a job with a 
fixed rate, to ensure equal pay for all freelancers, (2) detailed search criteria, including GenAl-related 
topics, (3) a long-standing presence with experienced freelancers, (4) a clear, consistent contract 
price, to fit University’s study participant reimbursement rules. We consulted multiple online 
resources, e.g., arc.dev/employer-blog/toptal-upwork-fivert-arc/, www.g2.com/categories/freelance- 
platforms?attributes[22 1]=420#grid 
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2.1.4 Participants List 


This list presents all participants of the study with their random codes, demographic data, and data 
related to developer experience and specific experience with GenAl models / APIs. 
3| = | sélsg E s 5g Siel l3 
S52 g | 53/83% 3 33 2% Sl) @)s/3/2|s|8 
35| 2 |z% |o5 | & a Sa DFH 
$ © I 7) 
aa 25 3 3 HND Website Development Nigeria X 
ab* 38 5 2 Master Economics Serbia X X 
ac 30 5 3 Bachelor Material Engineering Egypt XIXIXIXIXIX 
ad* 30 4 10 Master Machine Learning United States X X X 
ae 30 2 4 Bachelor History Montenegro X X X 
af 34 4 2 Bachelor Business Administration Germany X X 
ag 23 5 5 Bachelor Software Engineering India XIX 
ah 30 4 6 Bachelor Electronic Engineering India XIXJIX X 
ai 33 4 5 Master Software Engineering United States xX | X X X 
bj 24 7 3 Master Computer Science Serbia X X X 
bk 24 7 4 Master Computer Science India X XIX 
cl* 55 2 15 Bachelor Computer Science United States X XIX X 
cm 40 6 15 Master Software Engineering United States xX |X |X 
dn 23 4 5 Bachelor Geography Nigeria X |X }|X]X X 
do* 31 5 5 Master Computer Science Kenya xX | X X 
dp 22 3 5 Student Computer Science Poland XIX 
er 43 3 7 Bachelor Computer Science Turkey XIXIXI|X X 
es 37 4 10 Bachelor Electronic Engineering Indonesia X X X 
et 38 7 15 Bachelor Computer Science United States X X 
eu* 35 4 15 Master IT Management Serbia X X X 
ev* 38 6 13 Master Computer Science France X X X 
gw 36 11 10 PhD Machine Learning Netherlands X X 
gx 39 7 20 Master Human Computer Interaction United States X X X 
hy 29 5 Bachelor Computer Science Pakistan X X 
hz 35 3 10 Master Software Engineering France X 
ia* 28 3 PhD Physics Romania X X 
jb 31 4 7 Bachelor Computer Science Portugal xX | X X 
jc 24 5 2 Bachelor Information Systems United States X X 
jd 42 5 20 Bachelor Computer Science United Kingdom X xX | X xX | X 
je 42 3 10 PhD Physics United Kingdom X 
jf* 23 5 5 Bachelor Computer Science Romania X 
g 33 10 10 aster Physics Sweden XIXJIXIX X 
h 32 5 7 Bachelor Computer Science India xX |X 
ki 30 2 1 Bachelor Electronic Engineering Sri Lanka X 
lj 52 6 10 PhD Telecommunications France X 
mk 35 2 3 Master Communication New Zealand X 
ml* 30 4 5 PhD Machine Learning Pakistan X X xX |X X 
om 23 5 3 Student Geography Nigeria X |X |X] X 
pn 38 2 2 Master Telecommunications Montenegro X 
ro 36 15 6 aster Machine Learning India XIX XIX X 
rp 31 3 3 Bachelor Commerce Dominican Republic | X XIX 
sr 25 4 2 Bachelor Computer Science Pakistan X 
ss* 30 5 5 Master Mathematics Pakistan xX |X |X 
st 23 6 3 Bachelor Computer Science Nigeria XIXIXIX 
su 34 4 9 Master Computer Science Serbia XIXIXIXIX 
tv 24 5 7 Master Computer Science Pakistan xX |X 
tw 27 3 2 Bachelor Economics Germany X 
tx* 30 3 6 Master | Control & Automation Engineering Turkey xX | X 
ty 51 8 15 PhD Natural language processing Finland X 
uz 30 5 2 Master Electronic Engineering Pakistan X |X] X X 
va* 19 3 3 Student Computer Science Ukraine XIXIX X 
wb 49 3 8 Master Psychology Portugal X 
SUM 52 |21|18|17|13| 6 | 5 | 11 
AVG 32.6 4.8 6.9 
MED 31 4 5 
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2A 


This list provides more information about the participants quoted in the manuscript. 


Quoted Participants 


ob 
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g 8 z g e 8 x g experience concerning 
=S oO = H i 
6 S| 8) eis so ks 2 GenAl platforms 
— = 
lel Ela ORED Se o = 
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Salers els hci 5 SD - Stable Diffusion, 
ell Se |S gss p t tom self-trained model: 
£ Eq A uw T uw Y ` © oO custom -custom setl-trained mo: els) 
- Two exemplary use cases for which the interviewee developed solutions 
abl Sel 5] 2 Sc] Economics} Serbia GPT, LLaMA, Vicuna 


- Classifying SaaS websites with GPT and Vicuna 


- Investment and trading prediction using GPT 


and LLaMA 


ad | 30] 4 | 10}MSc ML/AI USA GPT, LaMDA, Others 
- Al sales assistant based on fine-tuned GPT 
- Recommendation chatbot for retail based on fine-tuned GPT 

cl | 55) 2 | 15 BSE cs USA GPT, MJ, PaLM, SD 
- Generative shoe design with text-to-image models 
- Designing Kubernetes cloud environment using GPT and Bard/LaMDA 

dola || & || Sy IMSE cS Kenya DE, GPT, LLaMA 
- Text summarization of demand letters in financial industry using GPT 
- API for Q&A bots for text-to-text and table-to-text scnearios 

eu |35| 4 | 15|MSc| IT Mngmt. | Serbia GPT, LLaMA, Vicuna 
- Q&A chatbot for local knowledge base based on GPT 
- Collection of news on a topic and providing summary — GPT 

ev |38| 6 | 13| MSc cS France GPT, MJ, Others 


- Integra 


- Customer support chatbot for home appliances based on GPT 
ing Midjourney output in an AR application in luxury industry 


ia 


28 


3 | 5|PhD| Physics |Romania 


GPT, MJ 


- Summarizatio 


n of market research reports based on GPT 
- Summarization of output obtained from an SQL database using GPT 


jf 


23 


5-5 ||Boc Cs Romania 


GPT 


-Q& 


- Customer support assistant using d 
A chatbot as a medical assistan 


ynamically scrapped content, uses GPT 
based on GPT 


ml 


30 


4| 5|PhD| ML/AI 


Pakistan 


GPT, LLaMA, SD, LaMDA 


- Customized medical prescription recommendation based on GPT 


odules for search and Q&A for regulatory documents based on GPT 
ss (S0 5 5 Sc Math. Pakistan DE, GPT, SD, custom 
- Generation of videos with avatars reading input text, with various models 
obile guidance app using video input and output with custom model 
X30 3 |) 1 MSc CA Turkey DE, GPT 
- Generation of Game Design Documents for mobile games based on GPT 
nstant game content generation by player with voice based based on GPT 
val tol 3] 3 Stud. cS Ukraine DE, GPT, SD, Others 
- Teaching assistant Q&A-bot providing sources, images, and summaries 


- Generating a multi-perspective, multi-opinion newsletter with GPT 
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2.2 Data Collection 


2.2.1 Survey Questions 


Survey was hosted on a LimeSurvey server of the university. Participants received invitation to the survey 
after accepting the contract on Upwork. All communication with the participants was happening through 
Upwork. 


[Informed Consent] 


1. Which of the following tools did you use in the apps you developed? 
AlphaFold 
BLOOM 
Chinchilla Al 
DALL-E 

GPT / ChatGPT 
LaMDA/ Bard 
LLaMA 
Midjourney 
PaLM 

Stable Diffusion 
Other: [...] 


O- oO: © G -O-O-O 0 Org 


2. In what projects and for what did you use the tools? (open question) 
Provide two or more examples. A brief description is enough, e.g., "Connected to GPT to create 
ideas on a given topic in a brainstorming app." or "Used Stable Diffusion to generate graphical 
summaries of daily news for a news app." If you want, feel free to paste external links. 
Please, give preference to projects or usage scenarios which challenged you most. 

3. Whatis the biggest challenge when creating apps with generative Al capabilities? (open 
question) 

4. Whatis the biggest reward from integrating generative Al capabilities in apps? (open question) 


2.2.2 Interview Guide 

After the survey, participants were provided access to a calendar in which they could book an 
appointment that suits their time schedule. After booking the appointment, they received access to the 
interview guide so they could see it before the interview (but they were not requested to do so). 


Interview “Developing Apps that use Generative Al” 


The interview will be conducted in a semi-structured manner. It means, | will try to address most of 
the topics listed below. However, we might change the order of the questions depending on how the 
conversation develops. Some of the questions might not perfectly fit your specific situation or 
experience. Do not worry. We will figure out how to go about it. 


Part 1: Opening 
Short Introduction 

- Goal of the study 
- My background 

- Your background 


Part 2: Example Projects 
Which project based on generative Al are you most proud of? Why? 
Were there any projects involving generative Al that did not work out so well? (optional) 


Is there any other project you want to tell me about? (optional) 


Part 3: Comparison to Projects which don’t Involve Generative Al 
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What has surprised you most about projects involving generative Al so far? 
How are generative Al projects different from / similar to earlier projects? 
What is the biggest uncertainty in generative Al projects? How do you deal with them? 


Part 4: General Attitude 
In the survey you mentioned the following challenge: ... (for each challenge listed in the survey) 
Can you explain to me what exactly you mean by this? Why is this a challenge? 

What does make the most fun to you when you are developing software based on generative Al? 
What special skills does a person need to develop software based on generative Al? 

What are the positive and negative implications of generative Al for you as the freelancer? 


2.3 Data Analysis 


2.3.1 Interviews - Initial Coding 
(coder: 2nd author, approach: bottom up, goal: identify passages pointing to challenges) 


In the early phase of analysis, the coder decided to identify challenges based on four different sensitizing 
perspectives: ethics, society, emotions, and technology inspired by the coders’ professional background 
(independent consultant and coach for teams and individuals in IT industry). The coding was supervised 
by both other authors and edge cases were discussed among them to assure reliability. We provide those 
codes here for transparency, but we want to stress that this differentiation was not further used during 
the analysis. Instead, the authors regrouped all 268 segments as described below. Here are the four 
categories used in the first round of coding along with the number of segments that were subsumed 
under those codes (numbers do not sum up to 268 as some passages received more than one code): 

- ethical challenges (21 segments), 

- social challenges (23 segments), 

- emotional challenges (86 segments), 

- technical challenges (141 segments). 

- OVERALL: 268 segments including challenges from the freelancers’ perspective. 


2.3.2 Interviews —- Workshops and Sorting according to SWEBOK 


After the identification of 268 segments describing freelancers’ perspective on the challenges related to 
GenAl, authors conducted two interpretation workshops of two hours each in which they discussed the 
meaning of the challenges and potential ways to make sense of them. They were provided the selected 
passages prior to the first workshop to become familiar with the content of the data. The workshops were 
moderated by the first author and involved various deductive and inductive sorting approaches. Finally, 
inspired by the relevant literature in the field, authors settled on classification of the segments according 
to SWEBOK Knowledge Areas (extended by the KA Training and Testing Data) to reflect approaches 
implemented in meta-reviews, which assured best comparability to existing literature. We used the 
official SWEBOK descriptions and the exsiting categorizations from existing meta-reviews [1, 2] to guide 
the categorization conducted collaboratively. 


Knowledge Area 
(number of Description (based on SWEBOK) Example 
coded segments) 


Sofware The Software Requirements KA is concerned | Yes, / think I think the main problem is really, for example, when 


Renuirements with the elicitation, negotiation, analysis, when you talk with a client about the success criteria, they expect, 
a specification, and validation of software for example, to have quantified accuracy metrics of that specific use 
requirements. case. (ad) 


I think the challenge is understanding the LangChain and how it 
works. It's important to know that you just can't dump all the data 


Design is defined as both the process of and expect, you know, a large language model or the app to extract 
. defining the architecture, components, it. It needs to be converted. The data needs to be first converted into 
Software Design |. ae : , . : 
(31) interfaces, and other characteristics of a small size chunks. Understanding chunk overlap is an important 
system or component and the result of that | concept. (...) Then converting those text into embeddings, it's an 
process. important concept. (...) You cannot store text as it is ina SQL 


database. (...) There are a lot of new technologies that need to be 
that you need to really clear about before building an app. (ai) 
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Software 
Construction 
(63) 


This KA covers software construction 
fundamentals; managing software 
construction; construction technologies; 
practical considerations; and software 
construction tools. 


The biggest challenge is getting the right prompt to use, because 
sometimes even after you finish, you think you are done working. 
Then you input the API keys or the prompts at the end of the day. It 
doesn't work that you have to like keep trying and keep trying until 
you get the perfect prompts you want. (dn) 


Software Testing 
(12) 


The Software Testing KA includes the 
fundamentals of software testing; testing 
techniques; human-computer user interface 
testing and evaluation; test-related 
measures; and practical considerations. 


You cannot simply build a predictor for X, and then the customer 
can just say: ‘you are 10% off the goal, so, it's not working, sorry’. 
With generative Al, it's so subjective and it is very difficult to have a 
quantitative evaluation for a project and it's hard to to understand 
for the customers as well how to how to do that. (gw) 


Software Quality 
(12) 


Software Quality KA includes fundamentals 
of software quality; software quality 
management processes; and practical 
considerations. 


| think the feature that we are implementing will depend on the 
directive in the world. | mean regulation. (...) | think these 
technologies improve very fast. But for many countries, this 
regulation can't be produced in that top speed. (...) So, | think we 
should think about that regulation. (ki) 


Software 
Maintenance 
(14) 


The Software Maintenance KA includes 
fundamentals of software maintenance; key 
issues in software maintenance; the 
maintenance process; software maintenance 
techniques; disaster recovery techniques, 
and software maintenance tools. 


| would say that! am really a bit anxious about it, because like 
tomorrow OpenAl will change something and from that moment on 
[the app] will not work. (...) | could say ‘Client, that's okay, you are on 
your own. | don't care about what will happen afterwards’. My policy 
is that | give guarantee that if | miss something, | will be back. | will fix 
it because it's my bad. (va) 


The Software Configuration Management KA 


So when we're talking about GPT, it's a lot of people want to fine 


Process, Models 
and Methods 


(3) 


methods; measurement; and software 
process tools. Further, they also involve 
modeling; types of models; analysis; and 
software development methods. 


Software covers management of the SCM process; tune the model (...) Because if you do understand what it means, 
Configuration software configuration identification, control, | then you know, the job that was $500 is now going to be $5,000 
Management status accounting, and auditing; software because you have to run an instance. You have to run a whole 
(25) release management and delivery; and backend server and all of those things. You can't just make use of 
software configuration management tools. APIs and things like that anymore. (rp) 
Topics covered include process 
Software implementation and change; process Firstly, we have to define our requirement what kind of output we 
Engineering definition; process assessment models and_ | want to extract. Once we define our requirement, we can use 


OpenAl playground to try. Once, we get our output, we can let it go. 
Once we get at a better point, we can apply it with OpenAl API. So 
that is the process. (ki) 


So, | would say that the biggest challenge was that it's anew 


Testing Data 
(12) 


challenges related to generation, 
management, and use of training and test 
datasets for ML. 


Software i 3 : industry and it's developed very fast. And some articles that you 
e i The Software Engineering Professional . $ 
Engineering $ . i read today, are not working anymore tomorrow. So yeah, it's 
; Practice KA covers professionalism; codes of g ‘ ; $ 
Professional i : aan changing very, very fast and you don't really know like what will work 
> ethics; group dynamics; and communication > ; 
Practice ; and what will not until you try. And they are not that much 
skills. } a 
(41) information in general because there are no such stable 
communities. (va) 
The Software Engineering Management KA I'm definitely safer in this approach where I have to be very I mean, | 
Software covers initiation and scope definition; have to be very sure if I take on the project, so I can deliver great 
Engineering software project planning; software project | results for the client, and that client will be happy with the results. | 
Management enactment; product acceptance; review and | think this definitely has changed, | mean, the fact that I take on less 
(14) analysis of project performance; project projects regarding generative Al, because | like to be, | like to be 
closure; and software management tools. comfortable. (dp) 
We derived this KA from earlier meta-reviews | The quality of data varies from plan to plan and from project to 
Training and concerning SE4AI / SE4ML. It includes project based on what data they have and what they are trying to 


make of that data. And many of the times the model itself is able to 
handle some bad quality data like some missing data. The model 
itself are able to handle that. (ag) 


One of the intermediate results of the workshops was the identification of factors that were claimed to 
contribute to the challenges, as the interviewees were explaining where a challenge was coming from. 
The four factors were: hype — fashionableness of the GenAl, hype — novelty of the GenAl, technology — 
characteristics of GenAl paradigm, technology — characteristics of GenAl product. We used those four 
categories as an interpretative sensemaking frame to establish a shared understanding between the 
authors according to interpretative research approach [3, 6, 7] and used them to structure the results. 
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